Showing posts with label mos. Show all posts

Saturday, February 25, 2012

Methods to Objectively Evaluate Speech Quality

This post gives an overview of methods to objectively evaluate perceptual quality of speech. The science of perceptual speech quality measurements assessment has been progressing for the past two decades. Current state-of-the-art for full referenced measurements uses the third iteration of the standard. Good-quality implementations of the standards of the algorithms contained within the standards will deliver the 97% correlation with the subjective tests database depending on implantation of the test set up.

Commercial test solutions exist to measure speech from acoustic, electronic analog and digital interfaces.

Mean Opinion Score [MOS] is a scale used in voice telecommunications predicting the perceived quality of a speech sample. MOS describes speech clarity or intelligibility. The measure does not measure delay across a network or Echo. The scales runs from 1 to 5, a score of '1' is bad and '5' is excellent. MOS [Mean Opinion Score] test sessions comprise 15 to 25 people listening to speech files of good quality and of poor quality with impairments and scoring them subjectively. This subjective test process is specified in the standard ITU-T P.800.

Mean opinion score (MOS)
MOS	Quality	Impairment
5	Excellent	Imperceptible
4	Good	Perceptible but not annoying
3	Fair	Slightly annoying
2	Poor	Annoying
1	Bad	Very annoying

MOS is valuable for the characterization of any device where the voice is compressed and transmitted over networks. Such devices may be handsets, mobile phones and networks such as packet-based VoIP networks or wireless networks. Networks employing state-of-the-art codecs are optimized for the compression of voice and treat tomes such as DTMF tones in a different way from the voice. Therefore to the characterize quality of the network or the System under Test, real speech files need to be transmitted. Frequency response, levels, Echo and delay must be measured, but in addition to the perceived speech quality.

Subjective testing is obviously time-consuming and expensive. So algorithms have been developed to allow a computer to reference the pure incidented speech file and compare it with the received degraded file and calculate the MOS with a high degree of correlation to subjective MOS. The first such objective speech quality measurements were standardized in 1998 with the PSQM algorithm P.861 Objective quality measurement of telephone-band (300-3400 Hz) speech codecs. This was followed by ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ) which has gained widespread worldwide usage as a reliable method for characterizing most narrowband telephony systems.

The following are examples of Mean Opinion Scores for one implementation of different codecs:

Codec	Data rate [kbit/s]	Mean opinion score (MOS)
G.711 (ISDN)	64	4.1
iLBC	15.2	4.14
AMR	12.2	4.14
G.729	8	3.92
G.723.1	6.3	3.9
GSM EFR	12.2	3.8
G.726 ADPCM	32	3.85
G.729a	8	3.7
G.723.1	5.3	3.65
G.728	16	3.61
GSM FR	12.2	3.5

Wideband telephony networks are expected to improve the user experience including the intelligibility of voice conversations over highly compressed codecs as used in both packet and wireless networks. Hopefully the tardy phrase "can you hear me now" will become a less frequent part of our vocabulary.

High Definition or Wideband Telephony is just now coming into common usage. G.722 is the Wideband Telephony codec for VoIP and WB-AMR is now being tested for wireless networks. However, they do need to be tested to tune codec implementations, packet loss concealment algorithms and performance in areas of poor coverage or high congestion.

Here are the analog frequency definitions for the different forms of telecommunications:

PESQ was never designed to address wideband networks. In addition, vendors of time warping code codecs [e.g. EVRC] and Skype and iLBC were not content that PESQ accurately measured the full quality of their codecs.

In 2006 ITU-T commenced work on a new standard to address the limitations of PESQ and during 2011 the POLQA standard, Recommendation P.863, was published.

Speech uses the new POLQA speech quality metric for objective protection of MOS. The old PESQ algorithm (ITU-T P.862) has been used for narrowband telephony since it was approved in 2000. PESQ was not designed for Wideband Telephony and also did not represent well the speech quality of time warping codecs. POLQA addresses all these short comings and provides a scale that goes all the way to 24kHz audio.

It is desirable to use the same scale so that laboratories can compare new results for wideband telephony with their old PESQ database. However, the question of human expectation comes into play because all these objective measurements performed by computers must correlate or predict subjective experience. If you watch a video on your smart phone, you might consider the picture quality as being good. Your expectations are put in the context of the small screen and the convenience of the video being played on a handheld smartphone. If you would give you the same video on your brand-new expensive high-definition 1080P TV, you would be very disappointed even if the pixel resolution had been scaled to the 62 inches screen size. Your expectation of quality is tempered to the format in which you are viewing it.

Similarly with speech and audio. If you were to participate in a MOS test and invited into a studio where there were high fidelity speakers, orchestral classical music playing and told and asked to rate the quality of the High Definition speech you are about to hear, your expectations would be set high and you'd be more critical. You would score the audio lower than if you had been asked to rate the speech quality of your most recent cellular phone call.

POLQA offers two scales, the narrowband scale and the super wideband scale. Super wideband telephony reaches 14 kHz analog audio frequency. The narrowband focus scale maps directly onto the old desk scale and exploits the higher scores not given by test participants in narrowband tests.

• NB: Maximum MOS value 4.25

• WB: Maximum MOS value 4.5

• SWB: Maximum MOS value 4.75

So a score of 4.5, on the narrowband POLQA scale is experimentally the best value you will ever obtain with wideband telephony equipment. For POLQA, the maximum MOS value in tests is 4.75 .

In future years, the industry will migrate exclusively to using the super wideband POLQA scale as soon as users' expectations always expect high-definition or hi-fi quality to the communications audio.

	POLQA SWB	POLQA NB
14kHz 16 bit Linear	4.75
7kHz 16 bit Linear	4.5
AMR - WB	4
3.4KHz 16 bit Linear	3.8	4.5
G.711	3.7	4.3
EFR/AMR-FR 12.2kbps	3.6	4.1
EVRC 9.5 kbps	3.4	3.9
EVRC-B 9.5 kbps	3.5	4
AMR-HR 7.95 kbps	3.4	3.8

Applications for Perceptual Speech Quality Measurements

Perceptual speech quality measurements are used to make End-to-End measurements, for any network where voice codecs are used to compress the speech or where speech transmission systems or networks may introduce impairments, such as weak radio signals or multipath or packet loss or packet jitter.

Examples of systems where Perceptual Speech Quality Measurements are valuable:

Codec evaluation
Frame or packet concealment implementation
Headsets combining the digitization of sound
VoIP phone assessment, both softphone and dedicated VoIP phones
Mobile handsets
Digitizing radio & Intercom systems
VoIP networks
Wireless cellular networks
speech enhancement and noise reduction systems
transcoders

A new important feature added to POLQA is its ability to measure the improvement of speech quality for speech enhancement and noise reduction systems.

The picture shows the iLBC codec measuring 4.21 and the narrowband POLQA scale.

In over 2 decades where these tests have taken place, no statistically significant number of participants ever scored any speech recording as being excellent or 5.0. The highest score typically obtained in any test was 4.54. So this measurement for iLBC of 4.21 is a good score for the codec.

For more information on making PESQ and POLQA measurements, ensure you contact only renowned and well-respected test vendors because the science of speech quality measurements requires expertise and experience in many different areas - audio, analog electronics as well as computing. It is easy to make a measurement but care is required to ensure that measurement is accurate correlates to human subjective experience and is put into the context of the environment, resolution, format etc.

one of the most admired vendors for speech quality metrics is Malden Electronics, available in USA through Teraquant Corporation – www.teraquant.com

More information can be obtained at:

http://en.wikipedia.org/wiki/POLQA

Saturday, October 29, 2011

MOS [Mean Opinion Score] for High Definition or Wideband Telephony

Mean Opinion Score [MOS] is a scale from 1 to 5 indicating speech quality - 1 is bad and 5 is excellent. MOS test sessions comprise 15 to 25 people listening to speech files of good quality and of poor quality with impairments and scoring them subjectively. This subjective test process is specified in ITU-T P.800. In over 16 years where these tests have taken place, no statistically significant number of participants ever scored any speech recording as being excellent or 5.0. The highest score typically obtained in any test was 4.5.

High Definition or Wideband Telephony speech uses the new POLQA speech quality metric for objective protection of MOS. The old PESQ algorithm has been used for narrowband telephony since it was approved in 2000. It is desirable to use the same scale so that laboratories can compare new results for wideband telephony with their old PESQ database. However, the question of human expectation comes into play because all these objective measurements performed by computers must correlate or predict subjective experience. If you watch a video on your smart phone, you might consider the picture quality as being good. Your expectations are put in the context of the small screen and the convenience of the video being played on a handheld smartphone. If you would give you the same video on your brand-new expensive high-definition 1080P TV, you would be very disappointed even if the pixel resolution had been scaled to the 62 inches screen size. Your expectation of quality is tempered to the format in which you are viewing it.

Similarly with speech and audio. If you were to participate in a MOS test and invited into a studio where there were high fidelity speakers, orchestral classical music playing and told and asked to rate the quality of the High Definition speech you are about to hear, your expectations would be set high and you'd be more critical. You would score the audio lower than if you had been asked to rate the speech quality of your most recent cellular phone call.

POLQA offers two scales, the narrowband scale and the super wideband scale. Super wideband telephony reaches 14 kHz analog audio frequency. The narrowband focus scale maps directly onto the old desk scale and exploits the higher scores not given by test participants in narrowband tests.

• NB: Maximum MOS value 4.25

• WB: Maximum MOS value 4.5

• SWB: Maximum MOS value 4.75

So a score of 4.5, on the narrowband POLQA scale is experimentally the best value you will ever obtain with wideband telephony equipment. You could conceivably measure a MOS value of 4.75 if you were measuring super wideband equipment.

In future years, the industry will migrate exclusively to using the super wideband POLQA scale as soon as users' expectations always expect high-definition or hi-fi quality to the communications audio.

The picture shows the iLBC codec measured measuring 4.21 narrowband focus scale.

For more information on making PESQ and POLQA measurements, ensure you contact only renown and well-respected test vendors because the science of speech quality measurements requires expertise and experience in many different areas audio, analog electronics as well as computing. It is easy to make a measurement but care is required to ensure that measurement is accurate and correlates to human subjective experience

The most trusted vendor for speech quality metrics is Malden Electronics, available in USA through Teraquant Corporation – www.teraquant.com

See use in http://technorati.com - X84QTD2E9BS6

Tuesday, October 25, 2011

Can You Hear Me Now?

PolQA is the new ITU-T Standard for Speech Quality Measurement which embraces Wideband or High Definition Telephony.

“Can you hear me now?”

We’ve all heard the refrain. How often have you been on a mobile phone & not been able to hear your calling party? How often have you experienced drop-outs on a VoIP call and missed that vital clue, that's important piece of information the caller mentioned which allowed you to understand their needs. May be you lost the business as a result. Good clear speech quality means productivity, both in business and in personal life. Everyone is critically busy these days and if you have to ask folks to repeat themselves, you waste time, first-rate meaningful conversation and miss information.

The existing telephony network uses 200-34000Hz analog bandwidth, digitized at a sampling rate of 8kbps. 8 bits of vertical resolution multiplied by 8kbps gives the traditional 64kbps bandwidth required for a voice channel. Compression by codecs such as G.729 and iLBC VoIP and specifically iSAC for Skype and GSM-FR & EVRC for wireless transmits narrowband traditional telephony at data rates as low as 4kbps.
So now we can compress voice sports to very low bandwidths and at the same time we have broadband Internet. so what can we do to improve speech quality.

Wideband or High Definition Telephony technology is now appearing in VoIP networks and wireless networks using voice codecs such as G.722 and WB-AMR. This provides speech with an analog bandwidth up to 7kHz and gives a richer listening experience. Those problems you currently have trying to recognize which of your young nieces or nephews is speaking to you is due to high frequencies filtered out with narrowband telephony. Wideband telephony will reinstate these, enriching your telephone conversation experience and improving productivity through speech clarity. This technology will eventually send telephony speech all the way up to 20 kHz, the limit of human hearing, equivalent to hi-fi music systems.

3gpp release 5 introduces AMR-WB codec which gives enhanced speech quality using data rates of only 16kbps. So wideband telephony or high definition telephony is being made available to wireless cellular networks.

Tools to Automatically Measure Speech Quality

Determining the subjective speech quality of a transmission system has always been an expensive and laborious process. The tool described in ITU-T Rec. P.862 Perceptual Evaluation of Speech Quality – PESQ provides a rapid and repeatable result in a few moments. PESQ is an objective measurement tool i.e. a computer measures the quality of the received audio in relation to the audio that was transmitted. PESQ predicts or has a very accurate close correlation to the results of subjective listening tests [i.e. human beings listening to speech files] On telephony systems. The resulting quality score is analogous to the subjective “Mean Opinion Score” (MOS) measured using panel tests according to ITU-T P.800. Strictly speaking, MOS is a score derived from human subjective testing. The PESQ scores are calibrated using a large database of subjective tests.

The ITU-T selection process that resulted in the standardization of PESQ involved a wide range of conditions, with demanding correlation requirements set to ensure that it has good performance in assessing conventional fixed and mobile networks and packet-based transmission systems.

Since ITU-T Rec. P.862 was originally released in 2000, further mappings of the PESQ score have been created. PESQ-LQ modified the score to improve correlation with subjective test results at the high and low ends of the scale where the raw PESQ score was found to be less accurate. A new mapping described in ITU-T Rec. P.862.1 was been released that further modified the raw score and correlated better to subjective testing.

PESQ Shortcomings - Time Warping

PESQ takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components. The user interfaces have been designed to provide a simple access to this powerful algorithm, either directly from the analogue connection or from speech files recorded elsewhere.

PESQ Shortcomings
Noise Reduction: (Subjective > PESQ)

The performance of a network or a network element can be fully characterized using high quality analog test equipment and PESQ. High quality analog interfaces are needed because the test equipment itself very easily introduces impairments which are included in the measurement and drank the desk score lower than should be measured for the system under test or network element. Whilst it is possible to use phonetically balanced sentences and other test patterns, accurate and repeatable measurements of the active speech level, activity, delay, echo, noise and speech quality can be obtained quickly using artificial speech test stimulus in different languages, which comprehensively tests all voice sounds the codec may be incident with, but at the same time achieves the process quickly in a time efficient way. A graphical mapping of the errors provides a useful insight into how the signal has been degraded and exactly what kind of sounds course the codec core system and test problems.

Since the launch of PESQ in 2000, there have been many advances in codec design. Unfortunately, PESQ was not trained on these later designs and can produce scores that are lower than expected from subjective tests. Time-warping and voice quality enhancement techniques are particularly difficult for PESQ. The ITU agreed on a new standard, P.863 POLQA, in 2010. POLQA addresses many of the issues and produces reliable scores for codecs, both old and new. POLQA is Now available on a couple of speech quality measurement platforms but Malden is the only platform that provides a quiet, high-quality analog front-end and the only platform to be recommended.