对数似然比(log-likelihood ratio, LLR)测度
假设语音在短时间内可以表示为一个
p
p
p阶全极点的模型,即
x
(
n
)
=
∑
i
=
1
p
a
x
(
i
)
x
(
n
−
i
)
+
G
x
u
(
n
)
x\left( n \right)=\sum\limits_{i=1}^{p}{{{a}_{x}}\left( i \right)x\left( n-i \right)}+{{G}_{x}}u\left( n \right)
x(n)=i=1∑pax(i)x(n−i)+Gxu(n)
其中,
a
x
(
i
)
{{a}_{x}}\left( i \right)
ax(i)为滤波器系数,
G
x
{{G}_{x}}
Gx表示滤波器的增益大小,
u
(
n
)
u\left( n \right)
u(n)表示单位方差的白噪声。则,对数似然比(log-likelihood ratio, LLR)测度可以表示为
d
L
L
R
(
a
x
,
a
ˉ
x
^
)
=
log
a
ˉ
x
^
T
R
x
a
ˉ
x
^
a
x
T
R
x
a
x
{{d}_{LLR}}\left( {{\mathbf{a}}_{x}},{{{\mathbf{\bar{a}}}}_{{\hat{x}}}} \right)=\log \frac{\mathbf{\bar{a}}_{{\hat{x}}}^{T}{{\mathbf{R}}_{x}}{{{\mathbf{\bar{a}}}}_{{\hat{x}}}}}{\mathbf{a}_{x}^{T}{{\mathbf{R}}_{x}}{{\mathbf{a}}_{x}}}
dLLR(ax,aˉx^)=logaxTRxaxaˉx^TRxaˉx^
其中,
a
ˉ
x
=
[
1
,
−
α
x
(
1
)
,
−
α
x
(
2
)
,
⋯
,
−
α
x
(
p
)
]
T
{{\mathbf{\bar{a}}}_{x}}={{\left[ 1,-{{\alpha }_{x}}\left( 1 \right),-{{\alpha }_{x}}\left( 2 \right),\cdots ,-{{\alpha }_{x}}\left( p \right) \right]}^{T}}
aˉx=[1,−αx(1),−αx(2),⋯,−αx(p)]T表示干净语音的LPC系数,
a
ˉ
x
^
=
[
1
,
−
α
x
^
(
1
)
,
−
α
x
^
(
2
)
,
⋯
,
−
α
x
^
(
p
)
]
T
{{\mathbf{\bar{a}}}_{{\hat{x}}}}={{\left[ 1,-{{\alpha }_{{\hat{x}}}}\left( 1 \right),-{{\alpha }_{{\hat{x}}}}\left( 2 \right),\cdots ,-{{\alpha }_{{\hat{x}}}}\left( p \right) \right]}^{T}}
aˉx^=[1,−αx^(1),−αx^(2),⋯,−αx^(p)]T表示增强语音的LPC系数,
R
x
{{\mathbf{R}}_{x}}
Rx表示干净语音的自相关矩阵。
代码如下:
function llr_mean= comp_llr(cleanFile, enhancedFile);
% ----------------------------------------------------------------------
%
% Log Likelihood Ratio (LLR) Objective Speech Quality Measure
%
%
% This function implements the Log Likelihood Ratio Measure
% defined on page 48 of [1] (see Equation 2.18).
%
% Usage: llr=comp_llr(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% llr - computed likelihood ratio
%
% Note that the LLR measure is limited in the range [0, 2].
%
% Example call: llr =comp_llr('sp04.wav','enhanced.wav')
%
%
% References:
%
% [1] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements,
% Objective Measures of Speech Quality. Prentice Hall
% Advanced Reference Series, Englewood Cliffs, NJ, 1988,
% ISBN: 0-13-629056-6.
%
% Authors: Bryan L. Pellom and John H. L. Hansen (July 1998)
% Modified by: Philipos C. Loizou (Oct 2006) - limited LLR to be in [0,2]
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
% ----------------------------------------------------------------------
if nargin~=2
fprintf('USAGE: LLR=comp_llr(cleanFile.wav, enhancedFile.wav)\n');
fprintf('For more help, type: help comp_llr\n\n');
return;
end
alpha=0.95;
[data1, fs1]= audioread(cleanFile);
[data2, fs2]= audioread(enhancedFile);
if ( fs1~= fs2)
error( 'The two files do not match!\n');
end
len= min( length( data1), length( data2));
data1= data1( 1: len)+eps;
data2= data2( 1: len)+eps;
IS_dist= llr( data1, data2,fs1);
IS_len= round( length( IS_dist)* alpha);
IS= sort( IS_dist);
llr_mean= mean( IS( 1: IS_len));
end
function distortion = llr(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); %240; % window length in samples
skiprate = floor(winlength/4); % window skip in samples
if sample_rate<10000
P = 10; % LPC Analysis Order
else
P=16; % this could vary depending on sampling frequency.
end
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Log Likelihood Ratio
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Get the autocorrelation lags and LPC parameters used
% to compute the LLR measure.
% ----------------------------------------------------------
[R_clean, Ref_clean, A_clean] = ...
lpcoeff(clean_frame, P);
[R_processed, Ref_processed, A_processed] = ...
lpcoeff(processed_frame, P);
% ----------------------------------------------------------
% (3) Compute the LLR measure
% ----------------------------------------------------------
numerator = A_processed*toeplitz(R_clean)*A_processed';
denominator = A_clean*toeplitz(R_clean)*A_clean';
distortion(frame_count) = min(2,log(numerator/denominator));
start = start + skiprate;
end
end
function [acorr, refcoeff, lpparams] = lpcoeff(speech_frame, model_order)
% ----------------------------------------------------------
% (1) Compute Autocorrelation Lags
% ----------------------------------------------------------
winlength = max(size(speech_frame));
for k=1:model_order+1
R(k) = sum(speech_frame(1:winlength-k+1) ...
.*speech_frame(k:winlength));
end
% ----------------------------------------------------------
% (2) Levinson-Durbin
% ----------------------------------------------------------
a = ones(1,model_order);
E(1)=R(1);
for i=1:model_order
a_past(1:i-1) = a(1:i-1);
sum_term = sum(a_past(1:i-1).*R(i:-1:2));
rcoeff(i)=(R(i+1) - sum_term) / E(i);
a(i)=rcoeff(i);
a(1:i-1) = a_past(1:i-1) - rcoeff(i).*a_past(i-1:-1:1);
E(i+1)=(1-rcoeff(i)*rcoeff(i))*E(i);
end
acorr = R;
refcoeff = rcoeff;
lpparams = [1 -a];
end
Itakura-Satio(IS)测度
Itakura-Satio(IS)测度可以表示为
d
I
S
(
a
x
,
a
ˉ
x
^
)
=
G
x
a
ˉ
x
^
T
R
x
a
ˉ
x
^
G
ˉ
x
^
a
x
T
R
x
a
x
+
log
(
G
ˉ
x
^
G
x
)
−
1
{{d}_{IS}}\left( {{\mathbf{a}}_{x}},{{{\mathbf{\bar{a}}}}_{{\hat{x}}}} \right)=\frac{{{G}_{x}}\mathbf{\bar{a}}_{{\hat{x}}}^{T}{{\mathbf{R}}_{x}}{{{\mathbf{\bar{a}}}}_{{\hat{x}}}}}{{{{\bar{G}}}_{{\hat{x}}}}\mathbf{a}_{x}^{T}{{\mathbf{R}}_{x}}{{\mathbf{a}}_{x}}}+\log \left( \frac{{{{\bar{G}}}_{{\hat{x}}}}}{{{G}_{x}}} \right)-1
dIS(ax,aˉx^)=Gˉx^axTRxaxGxaˉx^TRxaˉx^+log(GxGˉx^)−1
其中,
G
x
{{G}_{x}}
Gx表示干净语音的全极点增益,
G
ˉ
x
^
{{\bar{G}}_{{\hat{x}}}}
Gˉx^表示增强语音的全极点增益,其中
G
x
{{G}_{x}}
Gx可以表示为
G
x
=
(
r
x
T
a
x
)
1
/
2
{{G}_{x}}\text{=}{{\left( \mathbf{r}_{x}^{T}{{\mathbf{a}}_{x}} \right)}^{1/2}}
Gx=(rxTax)1/2
其中,
r
x
=
[
r
x
(
0
)
,
r
x
(
1
)
,
⋯
,
r
x
(
p
)
]
T
{{\mathbf{r}}_{x}}={{\left[ {{r}_{x}}\left( 0 \right),{{r}_{x}}\left( 1 \right),\cdots ,{{r}_{x}}\left( p \right) \right]}^{T}}
rx=[rx(0),rx(1),⋯,rx(p)]T表示干净语音的自相关,也就是
R
x
{{\mathbf{R}}_{x}}
Rx的第一行。从中可以看出,IS测度表示的是全极点的增益之差,但这也是IS测度的缺点,因为幅度差异对音质的影响最小。
代码如下:
function is_mean= comp_is(cleanFile, enhancedFile);
% ----------------------------------------------------------------------
% Itakura-Saito (IS) Objective Speech Quality Measure
%
% This function implements the Itakura-Saito distance measure
% defined on page 50 of [1] (see Equation 2.26). See also
% Equation 12 (page 1480) of [2].
%
% Usage: IS=comp_is(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% IS - computed Itakura Saito measure
%
% Note that the IS measure is limited in the range [0, 100].
%
% Example call: IS =comp_is('sp04.wav','enhanced.wav')
%
%
% References:
%
% [1] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements,
% Objective Measures of Speech Quality. Prentice Hall
% Advanced Reference Series, Englewood Cliffs, NJ, 1988,
% ISBN: 0-13-629056-6.
%
% [2] B.-H. Juang, "On Using the Itakura-Saito Measures for
% Speech Coder Performance Evaluation", AT&T Bell
% Laboratories Technical Journal, Vol. 63, No. 8,
% October 1984, pp. 1477-1498.
%
% Authors: Bryan L. Pellom and John H. L. Hansen (July 1998)
% Modified by: Philipos C. Loizou (Oct 2006) - limited IS to be in [0,100]
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
% ----------------------------------------------------------------------
if nargin~=2
fprintf('USAGE: IS=comp_is(cleanFile.wav, enhancedFile.wav)\n');
fprintf('For more help, type: help comp_is\n\n');
return;
end
alpha=0.95;
[data1, fs1]= audioread(cleanFile);
[data2, fs2]= audioread(enhancedFile);
if ( fs1~= fs2)
error( 'The two files do not match!\n');
end
len= min( length( data1), length( data2));
data1= data1( 1: len)+eps;
data2= data2( 1: len)+eps;
IS_dist= is( data1, data2,fs1);
IS_len= round( length( IS_dist)* alpha);
IS= sort( IS_dist);
is_mean= mean( IS( 1: IS_len));
end
function distortion = is(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Scale both clean speech and processed speech to have same dynamic
% range. Also remove DC component from each signal
% ----------------------------------------------------------------------
%clean_speech = clean_speech - mean(clean_speech);
%processed_speech = processed_speech - mean(processed_speech);
%processed_speech = processed_speech.*(max(abs(clean_speech))/ max(abs(processed_speech)));
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
%sample_rate = 8000; % default sample rate
winlength = round(30*sample_rate/1000); %240; % window length in samples
skiprate = floor(winlength/4); % window skip in samples
if sample_rate<10000
P = 10; % LPC Analysis Order
else
P=16; % this could vary depending on sampling frequency.
end
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Itakura-Saito Measure
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Get the autocorrelation lags and LPC parameters used
% to compute the IS measure.
% ----------------------------------------------------------
[R_clean, Ref_clean, A_clean] = ...
lpcoeff(clean_frame, P);
[R_processed, Ref_processed, A_processed] = ...
lpcoeff(processed_frame, P);
% ----------------------------------------------------------
% (3) Compute the IS measure
% ----------------------------------------------------------
numerator = A_processed*toeplitz(R_clean)*A_processed';
denominator = max(A_clean*toeplitz(R_clean)*A_clean',eps);
gain_clean = max(R_clean*A_clean',eps); % this is gain
gain_processed = max(R_processed*A_processed',eps); % squared (sigma^2)
ISvalue=(gain_clean/gain_processed)*(numerator/denominator) + ...
log(gain_processed/gain_clean)-1;
distortion(frame_count) = min(ISvalue,100);
start = start + skiprate;
end
end
function [acorr, refcoeff, lpparams] = lpcoeff(speech_frame, model_order)
% ----------------------------------------------------------
% (1) Compute Autocorrelation Lags
% ----------------------------------------------------------
winlength = max(size(speech_frame));
for k=1:model_order+1
R(k) = sum(speech_frame(1:winlength-k+1) ...
.*speech_frame(k:winlength));
end
% ----------------------------------------------------------
% (2) Levinson-Durbin
% ----------------------------------------------------------
a = ones(1,model_order);
E(1)=R(1);
for i=1:model_order
a_past(1:i-1) = a(1:i-1);
sum_term = sum(a_past(1:i-1).*R(i:-1:2));
rcoeff(i)=(R(i+1) - sum_term) / E(i);
a(i)=rcoeff(i);
a(1:i-1) = a_past(1:i-1) - rcoeff(i).*a_past(i-1:-1:1);
E(i+1)=(1-rcoeff(i)*rcoeff(i))*E(i);
end
acorr = R;
refcoeff = rcoeff;
lpparams = [1 -a];
end
倒谱距离(Cepstrum Distance)测度
倒谱系数可以表示为
c
(
m
)
=
a
m
+
∑
k
=
1
m
−
1
k
m
c
(
k
)
a
m
−
k
,
1
≤
m
≤
p
c\left( m \right)={{a}_{m}}+\sum\limits_{k=1}^{m-1}{\frac{k}{m}c\left( k \right){{a}_{m-k}}}\ \ \ \ \ \ ,1\le m\le p
c(m)=am+k=1∑m−1mkc(k)am−k ,1≤m≤p
其中,
a
j
a_{j}
aj表示LPC系数。
那么一种倒谱距离测度可以表示为
d
c
e
p
(
c
x
,
c
ˉ
x
^
)
=
10
ln
10
2
∑
k
=
1
p
[
c
x
(
k
)
−
c
x
^
(
k
)
]
2
{{d}_{cep}}\left( {{\mathbf{c}}_{x}},{{{\mathbf{\bar{c}}}}_{{\hat{x}}}} \right)=\frac{10}{\ln 10}\sqrt{2\sum\limits_{k=1}^{p}{{{\left[ {{c}_{x}}\left( k \right)-{{c}_{{\hat{x}}}}\left( k \right) \right]}^{2}}}}
dcep(cx,cˉx^)=ln10102k=1∑p[cx(k)−cx^(k)]2
其中,
c
x
(
k
)
{{c}_{x}}\left( k \right)
cx(k)表示干净语音的倒谱系数,
c
x
^
(
k
)
{{c}_{{\hat{x}}}}\left( k \right)
cx^(k)表示增强语音的倒谱系数。
代码如下:
function cep_mean= comp_cep(cleanFile, enhancedFile);
% ----------------------------------------------------------------------
% Cepstrum Distance Objective Speech Quality Measure
%
% This function implements the cepstrum distance measure used
% in [1]
%
% Usage: CEP=comp_cep(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% CEP - computed cepstrum distance measure
%
% Note that the cepstrum measure is limited in the range [0, 10].
%
% Example call: CEP =comp_cep('sp04.wav','enhanced.wav')
%
%
% References:
%
% [1] Kitawaki, N., Nagabuchi, H., and Itoh, K. (1988). Objective quality
% evaluation for low bit-rate speech coding systems. IEEE J. Select.
% Areas in Comm., 6(2), 262-273.
%
% Author: Philipos C. Loizou
% (LPC routines were written by Bryan Pellom & John Hansen)
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
% ----------------------------------------------------------------------
if nargin~=2
fprintf('USAGE: CEP=comp_cep(cleanFile.wav, enhancedFile.wav)\n');
fprintf('For more help, type: help comp_cep\n\n');
return;
end
alpha=0.95;
[data1, fs1]= audioread(cleanFile);
[data2, fs2]= audioread(enhancedFile);
if ( fs1~= fs2)
error( 'The two files do not match!\n');
end
len= min( length( data1), length( data2));
data1= data1( 1: len)+eps;
data2= data2( 1: len)+eps;
IS_dist= cepstrum( data1, data2,fs1);
IS_len= round( length( IS_dist)* alpha);
IS= sort( IS_dist);
cep_mean= mean( IS( 1: IS_len));
end
function distortion = cepstrum(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Scale both clean speech and processed speech to have same dynamic
% range. Also remove DC component from each signal
% ----------------------------------------------------------------------
%clean_speech = clean_speech - mean(clean_speech);
%processed_speech = processed_speech - mean(processed_speech);
%processed_speech = processed_speech.*(max(abs(clean_speech))/ max(abs(processed_speech)));
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); %240; % window length in samples
skiprate = floor(winlength/4); % window skip in samples
if sample_rate<10000
P = 10; % LPC Analysis Order
else
P=16; % this could vary depending on sampling frequency.
end
C=10*sqrt(2)/log(10);
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Itakura-Saito Measure
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Get the autocorrelation lags and LPC parameters used
% to compute the IS measure.
% ----------------------------------------------------------
[R_clean, Ref_clean, A_clean] = ...
lpcoeff(clean_frame, P);
[R_processed, Ref_processed, A_processed] = ...
lpcoeff(processed_frame, P);
C_clean=lpc2cep(A_clean);
C_processed=lpc2cep(A_processed);
% ----------------------------------------------------------
% (3) Compute the cepstrum-distance measure
% ----------------------------------------------------------
distortion(frame_count) = min(10,C*norm(C_clean-C_processed,2));
start = start + skiprate;
end
end
function [acorr, refcoeff, lpparams] = lpcoeff(speech_frame, model_order)
% ----------------------------------------------------------
% (1) Compute Autocorrelation Lags
% ----------------------------------------------------------
winlength = max(size(speech_frame));
for k=1:model_order+1
R(k) = sum(speech_frame(1:winlength-k+1) ...
.*speech_frame(k:winlength));
end
% ----------------------------------------------------------
% (2) Levinson-Durbin
% ----------------------------------------------------------
a = ones(1,model_order);
E(1)=R(1);
for i=1:model_order
a_past(1:i-1) = a(1:i-1);
sum_term = sum(a_past(1:i-1).*R(i:-1:2));
rcoeff(i)=(R(i+1) - sum_term) / E(i);
a(i)=rcoeff(i);
a(1:i-1) = a_past(1:i-1) - rcoeff(i).*a_past(i-1:-1:1);
E(i+1)=(1-rcoeff(i)*rcoeff(i))*E(i);
end
acorr = R;
refcoeff = rcoeff;
lpparams = [1 -a];
end
%----------------------------------------------
function [cep]=lpc2cep(a)
%
% converts prediction to cepstrum coefficients
%
% Author: Philipos C. Loizou
M=length(a);
cep=zeros(1,M-1);
cep(1)=-a(2);
for k=2:M-1
ix=1:k-1;
vec1=cep(ix).*a(k-1+1:-1:2).*ix;
cep(k)=-(a(k+1)+sum(vec1)/k);
end
end
加权谱斜率(Weighted Spectral Slope, WSS)测度
该测度是基于每个频带的频谱斜率之间的加权差值进行计算的。首先找到每个频带的谱斜率,其中频谱斜率可以表示为
{
S
x
(
k
)
=
C
x
(
k
+
1
)
−
C
x
(
k
)
S
ˉ
x
^
(
k
)
=
C
ˉ
x
^
(
k
+
1
)
−
C
ˉ
x
^
(
k
)
\left\{ \begin{aligned} & {{S}_{x}}\left( k \right)={{C}_{x}}\left( k+1 \right)-{{C}_{x}}\left( k \right) \\ & {{{\bar{S}}}_{{\hat{x}}}}\left( k \right)={{{\bar{C}}}_{{\hat{x}}}}\left( k+1 \right)-{{{\bar{C}}}_{{\hat{x}}}}\left( k \right) \\ \end{aligned} \right.
{Sx(k)=Cx(k+1)−Cx(k)Sˉx^(k)=Cˉx^(k+1)−Cˉx^(k)
其中,
S
x
(
k
)
{{S}_{x}}\left( k \right)
Sx(k)和
S
ˉ
x
^
(
k
)
{{\bar{S}}_{{\hat{x}}}}\left( k \right)
Sˉx^(k)分别表示第
k
k
k个频带的干净语音和增强语音的频谱斜率,
C
x
(
k
)
{{C}_{x}}\left( k \right)
Cx(k)和
C
x
^
(
k
)
{{C}_{{\hat{x}}}}\left( k \right)
Cx^(k)分别表示干净语音和增强语音的关键频谱。根据其是否靠近谱峰或者谱谷以及峰值是否对应频谱的最大峰值,对频谱斜率之间的差值进行加权,其中第
k
k
k个频带的权重可以表示为
W
(
k
)
=
K
max
[
K
max
+
C
max
−
C
x
(
k
)
]
K
l
o
c
max
[
K
l
o
c
max
+
C
l
o
c
max
−
C
x
(
k
)
]
W\left( k \right)=\frac{{{K}_{\max }}}{\left[ {{K}_{\max }}+{{C}_{\max }}-{{C}_{x}}\left( k \right) \right]}\frac{{{K}_{loc\max }}}{\left[ {{K}_{loc\max }}+{{C}_{loc\max }}-{{C}_{x}}\left( k \right) \right]}
W(k)=[Kmax+Cmax−Cx(k)]Kmax[Klocmax+Clocmax−Cx(k)]Klocmax
其中,
C
max
{{C}_{\max }}
Cmax表示各个频带中最大的对数谱幅度,
C
l
o
c
max
{{C}_{loc\max }}
Clocmax表示最靠近第
k
k
k个频带的普峰值,
K
max
{{K}_{\max }}
Kmax和
K
l
o
c
max
{{K}_{loc\max }}
Klocmax均为常量,可以通过调整使得主观听音测试结果与客观测度之间的相关性最大。一般,
K
max
=
20
{{K}_{\max }}=20
Kmax=20且
K
l
o
c
max
=
1
{{K}_{loc\max }}=1
Klocmax=1时,具有高的相关性。
加权谱斜率(Weighted Spectral Slope, WSS)测度最终可以表示为
d
W
S
S
=
(
C
x
,
C
ˉ
x
)
=
∑
k
=
1
36
W
(
k
)
(
S
x
(
k
)
−
S
ˉ
x
^
(
k
)
)
2
{{d}_{WSS}}=\left( {{C}_{x}},{{{\bar{C}}}_{x}} \right)=\sum\limits_{k=1}^{36}{W\left( k \right){{\left( {{S}_{x}}\left( k \right)-{{{\bar{S}}}_{{\hat{x}}}}\left( k \right) \right)}^{2}}}
dWSS=(Cx,Cˉx)=k=1∑36W(k)(Sx(k)−Sˉx^(k))2
平均的WSS测度则是由所有帧的WSS平均得到,这里就不再叙述。
代码如下:
function wss_dist= comp_wss(cleanFile, enhancedFile);
% ----------------------------------------------------------------------
%
% Weighted Spectral Slope (WSS) Objective Speech Quality Measure
%
% This function implements the Weighted Spectral Slope (WSS)
% distance measure originally proposed in [1]. The algorithm
% works by first decomposing the speech signal into a set of
% frequency bands (this is done for both the test and reference
% frame). The intensities within each critical band are
% measured. Then, a weighted distances between the measured
% slopes of the log-critical band spectra are computed.
% This measure is also described in Section 2.2.9 (pages 56-58)
% of [2].
%
% Whereas Klatt's original measure used 36 critical-band
% filters to estimate the smoothed short-time spectrum, this
% implementation considers a bank of 25 filters spanning
% the 4 kHz bandwidth.
%
% Usage: wss_dist=comp_wss(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% wss_dist - computed spectral slope distance
%
% Example call: ws =comp_wss('sp04.wav','enhanced.wav')
%
% References:
%
% [1] D. H. Klatt, "Prediction of Perceived Phonetic Distance
% from Critical-Band Spectra: A First Step", Proc. IEEE
% ICASSP'82, Volume 2, pp. 1278-1281, May, 1982.
%
% [2] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements,
% Objective Measures of Speech Quality. Prentice Hall
% Advanced Reference Series, Englewood Cliffs, NJ, 1988,
% ISBN: 0-13-629056-6.
%
% Authors: Bryan L. Pellom and John H. L. Hansen (July 1998)
% Modified by: Philipos C. Loizou (Oct 2006)
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
%
% ----------------------------------------------------------------------
if nargin~=2
fprintf('USAGE: WSS=comp_wss(cleanFile.wav, enhancedFile.wav)\n');
fprintf('For more help, type: help comp_wss\n\n');
return;
end
alpha= 0.95;
[data1, fs1]= audioread(cleanFile);
[data2, fs2]= audioread(enhancedFile);
if ( fs1~= fs2)
error( 'The two files do not match!\n');
end
len= min( length( data1), length( data2));
data1= data1( 1: len)+eps;
data2= data2( 1: len)+eps;
wss_dist_vec= wss( data1, data2,fs1);
wss_dist_vec= sort( wss_dist_vec);
wss_dist= mean( wss_dist_vec( 1: round( length( wss_dist_vec)*alpha)));
function distortion = wss(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Files musthave same length.');
return
end
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); % window length in samples
skiprate = floor(winlength/4); % window skip in samples
max_freq = sample_rate/2; % maximum bandwidth
num_crit = 25; % number of critical bands
USE_FFT_SPECTRUM = 1; % defaults to 10th order LP spectrum
n_fft = 2^nextpow2(2*winlength);
n_fftby2 = n_fft/2; % FFT size/2
Kmax = 20; % value suggested by Klatt, pg 1280
Klocmax = 1; % value suggested by Klatt, pg 1280
% ----------------------------------------------------------------------
% Critical Band Filter Definitions (Center Frequency and Bandwidths in Hz)
% ----------------------------------------------------------------------
cent_freq(1) = 50.0000; bandwidth(1) = 70.0000;
cent_freq(2) = 120.000; bandwidth(2) = 70.0000;
cent_freq(3) = 190.000; bandwidth(3) = 70.0000;
cent_freq(4) = 260.000; bandwidth(4) = 70.0000;
cent_freq(5) = 330.000; bandwidth(5) = 70.0000;
cent_freq(6) = 400.000; bandwidth(6) = 70.0000;
cent_freq(7) = 470.000; bandwidth(7) = 70.0000;
cent_freq(8) = 540.000; bandwidth(8) = 77.3724;
cent_freq(9) = 617.372; bandwidth(9) = 86.0056;
cent_freq(10) = 703.378; bandwidth(10) = 95.3398;
cent_freq(11) = 798.717; bandwidth(11) = 105.411;
cent_freq(12) = 904.128; bandwidth(12) = 116.256;
cent_freq(13) = 1020.38; bandwidth(13) = 127.914;
cent_freq(14) = 1148.30; bandwidth(14) = 140.423;
cent_freq(15) = 1288.72; bandwidth(15) = 153.823;
cent_freq(16) = 1442.54; bandwidth(16) = 168.154;
cent_freq(17) = 1610.70; bandwidth(17) = 183.457;
cent_freq(18) = 1794.16; bandwidth(18) = 199.776;
cent_freq(19) = 1993.93; bandwidth(19) = 217.153;
cent_freq(20) = 2211.08; bandwidth(20) = 235.631;
cent_freq(21) = 2446.71; bandwidth(21) = 255.255;
cent_freq(22) = 2701.97; bandwidth(22) = 276.072;
cent_freq(23) = 2978.04; bandwidth(23) = 298.126;
cent_freq(24) = 3276.17; bandwidth(24) = 321.465;
cent_freq(25) = 3597.63; bandwidth(25) = 346.136;
bw_min = bandwidth (1); % minimum critical bandwidth
% ----------------------------------------------------------------------
% Set up the critical band filters. Note here that Gaussianly shaped
% filters are used. Also, the sum of the filter weights are equivalent
% for each critical band filter. Filter less than -30 dB and set to
% zero.
% ----------------------------------------------------------------------
min_factor = exp (-30.0 / (2.0 * 2.303)); % -30 dB point of filter
for i = 1:num_crit
f0 = (cent_freq (i) / max_freq) * (n_fftby2);
all_f0(i) = floor(f0);
bw = (bandwidth (i) / max_freq) * (n_fftby2);
norm_factor = log(bw_min) - log(bandwidth(i));
j = 0:1:n_fftby2-1;
crit_filter(i,:) = exp (-11 *(((j - floor(f0)) ./bw).^2) + norm_factor);
crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:) > min_factor);
end
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Weighted Spectral
% Slope Measure
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Compute the Power Spectrum of Clean and Processed
% ----------------------------------------------------------
if (USE_FFT_SPECTRUM)
clean_spec = (abs(fft(clean_frame,n_fft)).^2);
processed_spec = (abs(fft(processed_frame,n_fft)).^2);
else
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(clean_frame,10);
clean_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(processed_frame,10);
processed_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
end
% ----------------------------------------------------------
% (3) Compute Filterbank Output Energies (in dB scale)
% ----------------------------------------------------------
for i = 1:num_crit
clean_energy(i) = sum(clean_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
processed_energy(i) = sum(processed_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
end
clean_energy = 10*log10(max(clean_energy,1E-10));
processed_energy = 10*log10(max(processed_energy,1E-10));
% ----------------------------------------------------------
% (4) Compute Spectral Slope (dB[i+1]-dB[i])
% ----------------------------------------------------------
clean_slope = clean_energy(2:num_crit) - ...
clean_energy(1:num_crit-1);
processed_slope = processed_energy(2:num_crit) - ...
processed_energy(1:num_crit-1);
% ----------------------------------------------------------
% (5) Find the nearest peak locations in the spectra to
% each critical band. If the slope is negative, we
% search to the left. If positive, we search to the
% right.
% ----------------------------------------------------------
for i = 1:num_crit-1
% find the peaks in the clean speech signal
if (clean_slope(i)>0) % search to the right
n = i;
while ((n<num_crit) & (clean_slope(n) > 0))
n = n+1;
end
clean_loc_peak(i) = clean_energy(n-1);
else % search to the left
n = i;
while ((n>0) & (clean_slope(n) <= 0))
n = n-1;
end
clean_loc_peak(i) = clean_energy(n+1);
end
% find the peaks in the processed speech signal
if (processed_slope(i)>0) % search to the right
n = i;
while ((n<num_crit) & (processed_slope(n) > 0))
n = n+1;
end
processed_loc_peak(i) = processed_energy(n-1);
else % search to the left
n = i;
while ((n>0) & (processed_slope(n) <= 0))
n = n-1;
end
processed_loc_peak(i) = processed_energy(n+1);
end
end
% ----------------------------------------------------------
% (6) Compute the WSS Measure for this frame. This
% includes determination of the weighting function.
% ----------------------------------------------------------
dBMax_clean = max(clean_energy);
dBMax_processed = max(processed_energy);
% The weights are calculated by averaging individual
% weighting factors from the clean and processed frame.
% These weights W_clean and W_processed should range
% from 0 to 1 and place more emphasis on spectral
% peaks and less emphasis on slope differences in spectral
% valleys. This procedure is described on page 1280 of
% Klatt's 1982 ICASSP paper.
Wmax_clean = Kmax ./ (Kmax + dBMax_clean - ...
clean_energy(1:num_crit-1));
Wlocmax_clean = Klocmax ./ ( Klocmax + clean_loc_peak - ...
clean_energy(1:num_crit-1));
W_clean = Wmax_clean .* Wlocmax_clean;
Wmax_processed = Kmax ./ (Kmax + dBMax_processed - ...
processed_energy(1:num_crit-1));
Wlocmax_processed = Klocmax ./ ( Klocmax + processed_loc_peak - ...
processed_energy(1:num_crit-1));
W_processed = Wmax_processed .* Wlocmax_processed;
W = (W_clean + W_processed)./2.0;
distortion(frame_count) = sum(W.*(clean_slope(1:num_crit-1) - ...
processed_slope(1:num_crit-1)).^2);
% this normalization is not part of Klatt's paper, but helps
% to normalize the measure. Here we scale the measure by the
% sum of the weights.
distortion(frame_count) = distortion(frame_count)/sum(W);
start = start + skiprate;
end
复合测度
通过对客观评价指标线性组合后得到的复合测度,包括signal distortion(SIG),noise distortion(BAK),overall quality (OVRL),计算方式如下:
C
s
i
g
=
3.093
−
1.029
⋅
LLR
+
0.603
⋅
PESQ
−
0.009
⋅
WSS
{{C}_{sig}}=3.093-1.029\cdot \text{LLR}+0.603\cdot \text{PESQ}-0.009\cdot \text{WSS}
Csig=3.093−1.029⋅LLR+0.603⋅PESQ−0.009⋅WSS
C
b
a
k
=
1.634
+
0.478
⋅
PESQ-0.007
⋅
WSS+0.063
⋅
segSNR
{{C}_{bak}}=1.634+0.478\cdot \text{PESQ-0}\text{.007}\cdot \text{WSS+0}\text{.063}\cdot \text{segSNR}
Cbak=1.634+0.478⋅PESQ-0.007⋅WSS+0.063⋅segSNR
C
o
v
l
=
1.594
+
0.805
⋅
PESQ
−
0.512
⋅
LLR
−
0.007
⋅
WSS
{{C}_{ovl}}=1.594+0.805\cdot \text{PESQ}-0.512\cdot \text{LLR}-0.007\cdot \text{WSS}
Covl=1.594+0.805⋅PESQ−0.512⋅LLR−0.007⋅WSS
代码如下:
function [Csig,Cbak,Covl]= composite(cleanFile, enhancedFile);
% ----------------------------------------------------------------------
% Composite Objective Speech Quality Measure
%
% This function implements the composite objective measure proposed in
% [1].
%
% Usage: [sig,bak,ovl]=composite(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% sig - predicted rating [1-5] of speech distortion
% bak - predicted rating [1-5] of noise distortion
% ovl - predicted rating [1-5] of overall quality
%
% In addition to the above ratings (sig, bak, & ovl) it returns
% the individual values of the LLR, SNRseg, WSS and PESQ measures.
%
% Example call: [sig,bak,ovl] =composite('sp04.wav','enhanced.wav')
%
%
% References:
%
% [1] Hu, Y. and Loizou, P. (2006). Evaluation of objective measures
% for speech enhancement. Proc. Interspeech, Pittsburg, PA.
%
% Authors: Yi Hu and Philipos C. Loizou
% (the LLR, SNRseg and WSS measures were based on Bryan Pellom and John
% Hansen's implementations)
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
% ----------------------------------------------------------------------
if nargin~=2
fprintf('USAGE: [sig,bak,ovl]=composite(cleanFile.wav, enhancedFile.wav)\n');
fprintf('For more help, type: help composite\n\n');
return;
end
alpha= 0.95;
[data1, fs1]= audioread(cleanFile);
[data2, fs2]= audioread(enhancedFile);
if ( fs1~= fs2)
error( 'The two files do not match!\n');
end
len= min( length( data1), length( data2));
data1= data1( 1: len)+eps;
data2= data2( 1: len)+eps;
% -- compute the WSS measure ---
%
wss_dist_vec= wss( data1, data2,fs1);
wss_dist_vec= sort( wss_dist_vec);
wss_dist= mean( wss_dist_vec( 1: round( length( wss_dist_vec)*alpha)));
% --- compute the LLR measure ---------
%
LLR_dist= llr( data1, data2,fs1);
LLRs= sort(LLR_dist);
LLR_len= round( length(LLR_dist)* alpha);
llr_mean= mean( LLRs( 1: LLR_len));
% --- compute the SNRseg ----------------
%
[snr_dist, segsnr_dist]= snr( data1, data2,fs1);
snr_mean= snr_dist;
segSNR= mean( segsnr_dist);
% -- compute the pesq ----
%
% if Srate1==8000, mode='nb';
% elseif Srate1 == 16000, mode='wb';
% else,
% error ('Sampling freq in PESQ needs to be 8 kHz or 16 kHz');
% end
[pesq_mos_scores]= pesq(cleanFile, enhancedFile);
if length(pesq_mos_scores)==2
pesq_mos=pesq_mos_scores(1); % take the raw PESQ value instead of the
% MOS-mapped value (this composite
% measure was only validated with the raw
% PESQ value)
else
pesq_mos=pesq_mos_scores;
end
% --- now compute the composite measures ------------------
%
Csig = 3.093 - 1.029*llr_mean + 0.603*pesq_mos-0.009*wss_dist;
Csig = max(1,Csig); Csig=min(5, Csig); % limit values to [1, 5]
Cbak = 1.634 + 0.478 *pesq_mos - 0.007*wss_dist + 0.063*segSNR;
Cbak = max(1, Cbak); Cbak=min(5,Cbak); % limit values to [1, 5]
Covl = 1.594 + 0.805*pesq_mos - 0.512*llr_mean - 0.007*wss_dist;
Covl = max(1, Covl); Covl=min(5, Covl); % limit values to [1, 5]
fprintf('\n LLR=%f SNRseg=%f WSS=%f PESQ=%f\n',llr_mean,segSNR,wss_dist,pesq_mos);
return; %=================================================================
function distortion = wss(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Files musthave same length.');
return
end
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); %240; % window length in samples
skiprate = floor(winlength/4); % window skip in samples
max_freq = sample_rate/2; % maximum bandwidth
num_crit = 25; % number of critical bands
USE_FFT_SPECTRUM = 1; % defaults to 10th order LP spectrum
n_fft = 2^nextpow2(2*winlength);
n_fftby2 = n_fft/2; % FFT size/2
Kmax = 20; % value suggested by Klatt, pg 1280
Klocmax = 1; % value suggested by Klatt, pg 1280
% ----------------------------------------------------------------------
% Critical Band Filter Definitions (Center Frequency and Bandwidths in Hz)
% ----------------------------------------------------------------------
cent_freq(1) = 50.0000; bandwidth(1) = 70.0000;
cent_freq(2) = 120.000; bandwidth(2) = 70.0000;
cent_freq(3) = 190.000; bandwidth(3) = 70.0000;
cent_freq(4) = 260.000; bandwidth(4) = 70.0000;
cent_freq(5) = 330.000; bandwidth(5) = 70.0000;
cent_freq(6) = 400.000; bandwidth(6) = 70.0000;
cent_freq(7) = 470.000; bandwidth(7) = 70.0000;
cent_freq(8) = 540.000; bandwidth(8) = 77.3724;
cent_freq(9) = 617.372; bandwidth(9) = 86.0056;
cent_freq(10) = 703.378; bandwidth(10) = 95.3398;
cent_freq(11) = 798.717; bandwidth(11) = 105.411;
cent_freq(12) = 904.128; bandwidth(12) = 116.256;
cent_freq(13) = 1020.38; bandwidth(13) = 127.914;
cent_freq(14) = 1148.30; bandwidth(14) = 140.423;
cent_freq(15) = 1288.72; bandwidth(15) = 153.823;
cent_freq(16) = 1442.54; bandwidth(16) = 168.154;
cent_freq(17) = 1610.70; bandwidth(17) = 183.457;
cent_freq(18) = 1794.16; bandwidth(18) = 199.776;
cent_freq(19) = 1993.93; bandwidth(19) = 217.153;
cent_freq(20) = 2211.08; bandwidth(20) = 235.631;
cent_freq(21) = 2446.71; bandwidth(21) = 255.255;
cent_freq(22) = 2701.97; bandwidth(22) = 276.072;
cent_freq(23) = 2978.04; bandwidth(23) = 298.126;
cent_freq(24) = 3276.17; bandwidth(24) = 321.465;
cent_freq(25) = 3597.63; bandwidth(25) = 346.136;
bw_min = bandwidth (1); % minimum critical bandwidth
% ----------------------------------------------------------------------
% Set up the critical band filters. Note here that Gaussianly shaped
% filters are used. Also, the sum of the filter weights are equivalent
% for each critical band filter. Filter less than -30 dB and set to
% zero.
% ----------------------------------------------------------------------
min_factor = exp (-30.0 / (2.0 * 2.303)); % -30 dB point of filter
for i = 1:num_crit
f0 = (cent_freq (i) / max_freq) * (n_fftby2);
all_f0(i) = floor(f0);
bw = (bandwidth (i) / max_freq) * (n_fftby2);
norm_factor = log(bw_min) - log(bandwidth(i));
j = 0:1:n_fftby2-1;
crit_filter(i,:) = exp (-11 *(((j - floor(f0)) ./bw).^2) + norm_factor);
crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:) > min_factor);
end
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Weighted Spectral
% Slope Measure
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Compute the Power Spectrum of Clean and Processed
% ----------------------------------------------------------
if (USE_FFT_SPECTRUM)
clean_spec = (abs(fft(clean_frame,n_fft)).^2);
processed_spec = (abs(fft(processed_frame,n_fft)).^2);
else
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(clean_frame,10);
clean_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
a_vec = zeros(1,n_fft);
a_vec(1:11) = lpc(processed_frame,10);
processed_spec = 1.0/(abs(fft(a_vec,n_fft)).^2)';
end
% ----------------------------------------------------------
% (3) Compute Filterbank Output Energies (in dB scale)
% ----------------------------------------------------------
for i = 1:num_crit
clean_energy(i) = sum(clean_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
processed_energy(i) = sum(processed_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
end
clean_energy = 10*log10(max(clean_energy,1E-10));
processed_energy = 10*log10(max(processed_energy,1E-10));
% ----------------------------------------------------------
% (4) Compute Spectral Slope (dB[i+1]-dB[i])
% ----------------------------------------------------------
clean_slope = clean_energy(2:num_crit) - ...
clean_energy(1:num_crit-1);
processed_slope = processed_energy(2:num_crit) - ...
processed_energy(1:num_crit-1);
% ----------------------------------------------------------
% (5) Find the nearest peak locations in the spectra to
% each critical band. If the slope is negative, we
% search to the left. If positive, we search to the
% right.
% ----------------------------------------------------------
for i = 1:num_crit-1
% find the peaks in the clean speech signal
if (clean_slope(i)>0) % search to the right
n = i;
while ((n<num_crit) & (clean_slope(n) > 0))
n = n+1;
end
clean_loc_peak(i) = clean_energy(n-1);
else % search to the left
n = i;
while ((n>0) & (clean_slope(n) <= 0))
n = n-1;
end
clean_loc_peak(i) = clean_energy(n+1);
end
% find the peaks in the processed speech signal
if (processed_slope(i)>0) % search to the right
n = i;
while ((n<num_crit) & (processed_slope(n) > 0))
n = n+1;
end
processed_loc_peak(i) = processed_energy(n-1);
else % search to the left
n = i;
while ((n>0) & (processed_slope(n) <= 0))
n = n-1;
end
processed_loc_peak(i) = processed_energy(n+1);
end
end
% ----------------------------------------------------------
% (6) Compute the WSS Measure for this frame. This
% includes determination of the weighting function.
% ----------------------------------------------------------
dBMax_clean = max(clean_energy);
dBMax_processed = max(processed_energy);
% The weights are calculated by averaging individual
% weighting factors from the clean and processed frame.
% These weights W_clean and W_processed should range
% from 0 to 1 and place more emphasis on spectral
% peaks and less emphasis on slope differences in spectral
% valleys. This procedure is described on page 1280 of
% Klatt's 1982 ICASSP paper.
Wmax_clean = Kmax ./ (Kmax + dBMax_clean - ...
clean_energy(1:num_crit-1));
Wlocmax_clean = Klocmax ./ ( Klocmax + clean_loc_peak - ...
clean_energy(1:num_crit-1));
W_clean = Wmax_clean .* Wlocmax_clean;
Wmax_processed = Kmax ./ (Kmax + dBMax_processed - ...
processed_energy(1:num_crit-1));
Wlocmax_processed = Klocmax ./ ( Klocmax + processed_loc_peak - ...
processed_energy(1:num_crit-1));
W_processed = Wmax_processed .* Wlocmax_processed;
W = (W_clean + W_processed)./2.0;
distortion(frame_count) = sum(W.*(clean_slope(1:num_crit-1) - ...
processed_slope(1:num_crit-1)).^2);
% this normalization is not part of Klatt's paper, but helps
% to normalize the measure. Here we scale the measure by the
% sum of the weights.
distortion(frame_count) = distortion(frame_count)/sum(W);
start = start + skiprate;
end
%-----------------------------------------------
function distortion = llr(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); % window length in samples
skiprate = floor(winlength/4); % window skip in samples
if sample_rate<10000
P = 10; % LPC Analysis Order
else
P=16; % this could vary depending on sampling frequency.
end
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Log Likelihood Ratio
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Get the autocorrelation lags and LPC parameters used
% to compute the LLR measure.
% ----------------------------------------------------------
[R_clean, Ref_clean, A_clean] = ...
lpcoeff(clean_frame, P);
[R_processed, Ref_processed, A_processed] = ...
lpcoeff(processed_frame, P);
% ----------------------------------------------------------
% (3) Compute the LLR measure
% ----------------------------------------------------------
numerator = A_processed*toeplitz(R_clean)*A_processed';
denominator = A_clean*toeplitz(R_clean)*A_clean';
distortion(frame_count) = log(numerator/denominator);
start = start + skiprate;
end
%---------------------------------------------
function [acorr, refcoeff, lpparams] = lpcoeff(speech_frame, model_order)
% ----------------------------------------------------------
% (1) Compute Autocorrelation Lags
% ----------------------------------------------------------
winlength = max(size(speech_frame));
for k=1:model_order+1
R(k) = sum(speech_frame(1:winlength-k+1) ...
.*speech_frame(k:winlength));
end
% ----------------------------------------------------------
% (2) Levinson-Durbin
% ----------------------------------------------------------
a = ones(1,model_order);
E(1)=R(1);
for i=1:model_order
a_past(1:i-1) = a(1:i-1);
sum_term = sum(a_past(1:i-1).*R(i:-1:2));
rcoeff(i)=(R(i+1) - sum_term) / E(i);
a(i)=rcoeff(i);
a(1:i-1) = a_past(1:i-1) - rcoeff(i).*a_past(i-1:-1:1);
E(i+1)=(1-rcoeff(i)*rcoeff(i))*E(i);
end
acorr = R;
refcoeff = rcoeff;
lpparams = [1 -a];
% ----------------------------------------------------------------------
function [overall_snr, segmental_snr] = snr(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Scale both clean speech and processed speech to have same dynamic
% range. Also remove DC component from each signal
% ----------------------------------------------------------------------
%clean_speech = clean_speech - mean(clean_speech);
%processed_speech = processed_speech - mean(processed_speech);
%processed_speech = processed_speech.*(max(abs(clean_speech))/ max(abs(processed_speech)));
overall_snr = 10* log10( sum(clean_speech.^2)/sum((clean_speech-processed_speech).^2));
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); %240; % window length in samples
skiprate = floor(winlength/4); % window skip in samples
MIN_SNR = -10; % minimum SNR in dB
MAX_SNR = 35; % maximum SNR in dB
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Segmental SNR
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1: num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Compute the Segmental SNR
% ----------------------------------------------------------
signal_energy = sum(clean_frame.^2);
noise_energy = sum((clean_frame-processed_frame).^2);
segmental_snr(frame_count) = 10*log10(signal_energy/(noise_energy+eps)+eps);
segmental_snr(frame_count) = max(segmental_snr(frame_count),MIN_SNR);
segmental_snr(frame_count) = min(segmental_snr(frame_count),MAX_SNR);
start = start + skiprate;
end
参考文献
[1] Quackenbush S R . Objective measures of speech quality. Georgia Institute of Technology, 1995.
[2] B.-H. Juang, “On Using the Itakura-Saito Measures for Speech Coder Performance Evaluation”, AT&T BellLaboratories Technical Journal, Vol. 63, No. 8 October 1984, pp. 1477-1498.
[3] N. Kitawaki, H. Nagabuchi and K. Itoh, “Objective quality evaluation for low-bit-rate speech coding systems,” in IEEE Journal on Selected Areas in Communications, vol. 6, no. 2, pp. 242-248, Feb. 1988, doi: 10.1109/49.601.
[4] D. Klatt, “Prediction of perceived phonetic distance from critical-band spectra: A first step,” ICASSP '82. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1982, pp. 1278-1281, doi: 10.1109/ICASSP.1982.1171512.
[5] Hu Y, Loizou P C. Evaluation of objective measures for speech enhancement[C]//Ninth international conference on spoken language processing. 2006.
[6] Loizou P C. Speech enhancement: theory and practice[M]. CRC press, 2007.