A Composite Objective Measure on Subjective Evaluation of Speech Enhancement Algorithms

The purpose of speech enhancement algorithms is to improve speech quality, naturalness and intelligibility by eliminating the background noise and improving signal to noise ratio. There are several objective measures predicting the quality of noisy speech enhanced by noise suppression algorithms, and different objective measures capture different characteristics of the degraded signal. In this paper, the multiple linear regression analysis is used to obtain a composite measure which has high correlation with subjective tests, and the performance of several speech enhancement algorithms under car noise conditions is compared. The uncertainty of the results of the proposed measures on different speech enhancement algorithms is analyzed, and the reliability of the results is discussed


INTRODUCTION
Speech enhancement is concerned with improving perceptual aspects of speech that is degraded by background noise, and the main aim of speech enhancement is to improve speech quality and signal to noise ratio (SNR) level while preserving speech intelligibility. A large number of speech enhancement algorithms have been proposed such as the spectral-subtractive algorithms, the wiener algorithm, the minimum mean square error (MMSE) algorithms and the subspace algorithms [1].
Speech enhancement algorithms typically degrade the speech signal component while suppressing the background noise, particularly in low SNR conditions, which complicates the subjective evaluation of speech enhancement algorithms. It is not clear whether listeners evaluate their overall quality judgments basing on the signal distortion component, noise distortion component, or both, and this uncertainty decreases the reliability of the rating. Hence, ITU-T Rec. P.835 has been designed to lead the listeners to rate the speech signal, the background noise, and the overall effect of speech and noise separately [2].
Listening tests are usually time-consuming and expensive to conduct [3], so several objective measures have been proposed. However, most of these objective measures were developed for the purpose of evaluating the distortions introduced by speech codecs and communication channels, and it is not clear whether these objective measures are suitable for evaluating the speech quality enhanced by speech enhancement algorithm [4][5]. As a result, only a small number of studies were presented to examine the correlation between objective measure and the subjective quality evaluation of enhanced noise speech, such as the perceptual evaluation of speech quality (PESQ) for speech codec [6][7][8][9][10][11], the log likelihood ratio (LLR), the cepstrum (CEP) and segmental SNR (segSNR). However, the PESQ measure did not yield as high correlation coefficients with speech quality as that found with speech transmitted through network, whose correlation efficient was about 0.65 in term of signal distortion. The other conventional objective measures (CEP, LLR and segSNR) performed moderately well (by about 0.60) with overall quality whereas yielded poor correlation coefficient (by about 0.30) with ratings of background noise distortion [1].
Aiming to further improve the correlation coefficients for different types of distortion introduced by speech enhancement algorithms, a multiple linear regression analysis is used to obtain a new composite measure, which only consisted of five different objective measures. Then the measurement uncertainty of the proposed measure under different speech enhancement algorithms is investigated, and the reliability of the results is discussed.

A COMPOSITE MEASURE
Several existing objective measures have been combined to form a new measure by utilizing the linear regression analysis or nonlinear techniques [13]. Five widely used objective speech quality measures are selected in this paper, and they are the perceptual evaluation of speech quality (PESQ), the log likelihood ratio (LLR), the cepstrum (CEP), the frequency-weighted segmental SNR (fwSNRseg) and the frequency-variant fwSNRseg with 25 bands (fwSNRsegVar). As mentioned above, these different objective measures only capture different characteristics of the distorted signal which is monotonous to rate different kind of distorted signal [1].
The PESQ measure described in the ITU-T P.862 is capable of performing reliably across a wide range of codecs and network conditions. However, the performance of PESQ is found to be sensitive to measurement noise when clean reference samples were used [14]. The range of PESQ score is [0. 5, 4.5]. The log likelihood ratio (LLR) measure and the cepstrum (CEP) measure are proposed based on the dissimilarity between all-pole models of the clean and enhanced speech signals, which assume that speech can be represented by a p-th order all-pole model over short time intervals. The LLR measure represents the ratio of the energies of the prediction residuals of the enhanced and clean signals. The range of LLR score is [0, 2]. The CEP measure provides an estimate of the log spectral distance between two spectra with a score range of [0, 10]. The advantage of using the fwSNRseg is the flexibility of assigning different weights for different frequency bands. The range of fwSNRseg score is [10 dB, 35 dB]. Alternatively, the weights for each band can be obtained using the regression analysis to obtain fwSNRsegVar, which has a range of [10 dB, 35 dB].
Various statistics have been used to evaluate interrater reliability. The most common statistic is the Pearson's correlation coefficient between the first and second ratings. To obtain the Pearson's coefficient, listeners are presented with the same speech samples at two different testing sessions, and the Pearson's correlation between the subjective quality measure d S and the objective measure d O , is given by [1] where d S and d O are the mean values of d S and d O , respectively. The standard deviation of the error when the objective measure is used in place of the subjective measure is given by where ˆs  and ˆe  are the standard deviation of d S and error. A smaller value of ˆe  indicates that the objective measure is better at predicting subjective quality [13]. The first five columns (excluding the title column) in Table 1 show the correlation coefficients and standard deviations of the error for the five objective measures above, where the correlations were run between the objective measures and the subjective rating scores. A total of 43008 subjective scores were included in the correlations computation, encompassing two SNR level (5 dB and 10 dB). And the noisy database contains 30 IEEE sentences, which were produced by three male and three female speakers and recorded in a sound-proof booth using Tucker Davis Technologies (TDT) recording equipment, and sampled at 25 kHz and then down sampled to 8 kHz [1]. From Table 1, it can be found that the fwSNRsegVar measure yields the highest correlation with the three subjective scales in terms of OVL (overall quality), SIG (signal distortion) and BAK (background distortion). The second best measure is the PESQ measure, and it is also found that the LLR, CEP and fwSNRseg measures performed best in terms of predicting overall quality and signal distortion, but with a large standard deviation.
In order to improve the correlation coefficients, a multiple linear regression analysis is used to obtain a new composite measure. Basing on the database mentioned above, a total of 14 listeners (22-50 years old) were recruited for the listening test. No listeners participated in a listening test in the previous 3 months before this test. Correlations are calculated between the objective measure and the three subjective rating scores. A total of 5040 subjective listening scores for three rating scales are obtained, including two SNR levels (5 dB and 10 dB) and two different types of background noise. The regression analysis is applied on the objective scores of five measures above and the subjective scores for the three scales based on least square method by using the best fitting straight line. The weighting coefficients of each parameter are obtained, and the derived composite measures for signal distortion (CSIG), noise distortion (CBAK), and overall quality (COVL) are as follows, where the PESQ, LLR, CEP, fwSNRseg and fwSNRsegVar indicate the objective scores, and the subscript indicates objective measure derived for signal distortion (SIG), background noise distortion (BAK) and overall quality (OVL). The last column in Table 1 shows the correlation coefficients and standard deviations of the error for the proposed composite measures. Compared with other five objective measures, the proposed composite measures show moderate improvements over the existing objective measures in correlation, whereas the standard deviations of the error are smaller than other objective measures. The highest correlation (  =0.674) is obtained with the COVL measure. Being compared with the fwSNRsegVar method, the correlation of CSIG and COVL declines slightly, however, smaller standard deviations of the error are obtained with the proposed measure. This property might be better for evaluating subjective quality of distorted speech [13].

Selection of Experiment Conditions and Results
In order to evaluate the performance of the proposed composite measure for different speech enhancement algorithm, the same database mentioned above are selected, whose sentenses are corrupted only in car background noise environments.
The noise-corrupted sentences are processed by the speech enhancement algorithms mentioned above. Tables 2 and 3 present the objective scores obtained with the proposed measure, where the obtained average, the span of the objective values and the standard deviation are shown.  It can be found from the Tables that there is a large variability among different algorithms. The average objective score decreases for the low SNR condition, this is reasonable in terms of the perception of people under low SNR condition. The objective values vary significantly even under the same algorithm. For example, for the case of COVL at 10 dB SNR, the span of the KLT is as large as 0.817.

Statistical Analysis of the Uncertainty of the Proposed Measures Values
To analyze the probability distribution of the objective values, the histogram of the data obtained with the 6 adopted algorithms are processed. For the sake of brevity, Fig. 1 only shows the histogram related to the algorithm of the Wiener_as on CBAK. It is obtained by dividing the horizontal axis into bins of constant width equal to 0.1 and by reporting on the vertical axis the frequencies of the whole objective results falling into each bin. The mean and variance are 3.028 and 0.1342 respectively. By using a normal distribution to fit the distribution, the probability density function shows that the mean is 3.05 with a variance of 0.1308 [17]. The good agreement between the two curves suggests that the measurement results can be considered as normal distribution. The normal distribution assumption can be formally analyzed with the Chi-square test, and result shows that the assumption is true under the significance level of 0.05. In order to avoid Type II error (the error of failing to reject a null hypothesis when it is false) in the normality test, the Skewness and Kurtosis parameters of the measurement results are calculated, and the results are shown in Table 4. The Skewness parameter is a well-known indicator of probability density function symmetry with respect to its center value, whereas the Kurtosis indicates if the probability density function if peaked or flat with respective to a normal probability density function. In particular, null Skewness and Kurtosis equal to 3 are expected for normally distributed data [17]. As illustrated in Table 4, the measurement results of the Wiener_as algorithm can be approximately considered as normal distribution. Similar calculations (not presented in the paper) shows that the conclusion can be extended to the other algorithms too. Basing on this conclusion, the performance of different algorithms under car background noise environments is compared, and the reliability of the results is analyzed. For the sake of brevity, Table 5 only shows the confidence interval of COVL measure. In the second column of Table 5, the 95% confidence intervals of means are given. The third and fourth columns indicate the margin of error and the 95% confidence lower limit, whereas in the last column, the subjective MOS values in 10 dB car noise are obtained from Loizou [1]. Taking the MMSE algorithm for example, the average of the COVL values is between 3.127 and 3.259. If picking any value in this range as the approximation of the true value, the margin of error is not more than 0.1324 (two times of the margin of error). The reliability of the conclusion is 95%. Since the 95% confidence interval of the average estimate overlaps, z-test was performed to further investigate the performance of algorithms [17]. The results show that when the significance level is 0.05, there is no significant difference among the means of logMMSE, MMSE and logMMSE_ne; the same between the means of Specsub and KLT. Multiple COVL values of six algorithms indicate that the performance of different algorithms is significantly different. This conclusion is consistent to the 95% confidence lower limit.
As can be seen from the last column, subjective scores are consistent to the objective conclusion. The performances of the logMMSE, MMSE and logMMSE_ne algorithms are superior to others. The Wiener_as algorithm is better than the Specsub and KLT algorithms. The same conclusion can be drawn for the case with 5 dB SNR according to Table 6. Therefore, these testing results illustrate the differences between the speech enhancement algorithms are significant in both the objective and subjective testings. It should be noted here that the actual values of objective measures and subjective measures in Tables 5 and 6 do not line up well in some cases. The reason is that the correlations between objective and subjective measurements are low, especially for distorted speech.

CONCLUSIONS
A multiple linear regression analysis is used in this paper to obtain a new composite measure which has high correlation coefficients with small standard deviations. With the proposed composite measure, the majority of the correlation coefficients in terms of BAK are improved by about 0.2, and the standard deviations of the error are declined by about 0.2 to 0.4 in terms of OVL, SIG and BAK. Then, the uncertainty of the proposed measure under different test conditions is analyzed, and the values obtained by the proposed measure are shown to have almost normal distribution. Finally, 6 speech enhancement algorithms are investigated with the proposed measure, and the result shows that the differences between the speech enhancement algorithms are significant in both the objective and subjective testings. The composite objective measure can be regarded not only as subjective estimator but also as an overall system performance parameter for speech enhancement algorithms.