主观指标
MOS
最常用和相对简单的主观质量指标是分级判断方法,采取5个级别对被测语音的质量进行评价。待测语音的质量是在所有试听人员的评分上求平均得到的。这种方式被称作平均意见得分(Mean Opinion Score, MOS)。下表给出了语音评价的等级分级。
评分 | 语音质量 | 失真程度 |
---|---|---|
5 | 非常好(Excellent) | 不可察觉(Imperceptible) |
4 | 好(Good) | 略可察觉(Not annoying) |
3 | 一般(Fair) | 可察觉(Slightly annoying) |
2 | 差(Poor) | 尚可接受(Not objectionable) |
1 | 很差(Bad) | 难以接受(objectionable) |
MOS评分有两个阶段。训练阶段,听者需要听一系列参考信号,保证大家对质量评级的标准尽可能一致;评估阶段,试听人员对所听到的信号,进行主观打分。
客观指标
信噪比
SNR=10lo
g
10
∥
s
(
n
)
∥
2
∥
s
(
n
)
−
s
^
(
n
)
∥
2
\text{SNR=10lo}{{\text{g}}_{10}}\frac{{{\left\| \mathbf{s}\left( n \right) \right\|}^{2}}}{{{\left\| \mathbf{s}\left( n \right)-\mathbf{\hat{s}}\left( n \right) \right\|}^{2}}}
SNR=10log10∥s(n)−s^(n)∥2∥s(n)∥2
其中,
s
(
n
)
\mathbf{s}\left( n \right)
s(n)和
s
^
(
n
)
\mathbf{\hat{s}}\left( n \right)
s^(n)分别代表纯净语音和增强后的语音。
分段信噪比(时域)
SegSNR=
10
N
∑
n
=
1
N
log
10
(
∥
s
(
n
)
∥
2
∥
s
(
n
)
−
s
^
(
n
)
∥
2
)
\text{SegSNR=}\frac{10}{N}\sum\limits_{n=1}^{N}{{{\log }_{10}}\left( \frac{{{\left\| \mathbf{s}\left( n \right) \right\|}^{2}}}{{{\left\| \mathbf{s}\left( n \right)-\mathbf{\hat{s}}\left( n \right) \right\|}^{2}}} \right)}
SegSNR=N10n=1∑Nlog10(∥s(n)−s^(n)∥2∥s(n)∥2)
首先对语音信号进行分帧,
N
N
N为分帧的数目。需要注意,SegSNR是对所有语音帧求平均所得。该度量指标的使用,要先保证纯净语音和增强后的信号在时域上对齐,其与听者主观听觉感知有很高的关联。
代码如下:
function [snr_mean, segsnr_mean]= comp_snr(clean, enhanced, fs)
%
% Segmental Signal-to-Noise Ratio Objective Speech Quality Measure
%
% This function implements the segmental signal-to-noise ratio
% as defined in [1, p. 45] (see Equation 2.12).
%
% Usage: [SNRovl, SNRseg]=comp_snr(cleanFile.wav, enhancedFile.wav)
%
% cleanFile.wav - clean input file in .wav format
% enhancedFile - enhanced output file in .wav format
% SNRovl - overall SNR (dB)
% SNRseg - segmental SNR (dB)
%
% This function returns 2 parameters. The first item is the
% overall SNR for the two speech signals. The second value
% is the segmental signal-to-noise ratio (1 seg-snr per
% frame of input). The segmental SNR is clamped to range
% between 35dB and -10dB (see suggestions in [2]).
%
% Example call: [SNRovl,SNRseg]=comp_SNR('sp04.wav','enhanced.wav')
%
% References:
%
% [1] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements,
% Objective Measures of Speech Quality. Prentice Hall
% Advanced Reference Series, Englewood Cliffs, NJ, 1988,
% ISBN: 0-13-629056-6.
%
% [2] P. E. Papamichalis, Practical Approaches to Speech
% Coding, Prentice-Hall, Englewood Cliffs, NJ, 1987.
% ISBN: 0-13-689019-9. (see pages 179-181).
%
% Authors: Bryan L. Pellom and John H. L. Hansen (July 1998)
% Modified by: Philipos C. Loizou (Oct 2006)
%
% Copyright (c) 2006 by Philipos C. Loizou
% $Revision: 0.0 $ $Date: 10/09/2006 $
%-------------------------------------------------------------------------
% if nargin ~=2
% fprintf('USAGE: [snr_mean, segsnr_mean]= comp_SNR(cleanFile, enhdFile) \n');
% return;
% end
% [data1, Srate1, Nbits1]= audioread(cleanFile);
% [data2, Srate2, Nbits2]= audioread(enhdFile);
data1 = clean;
data2 = enhanced;
Srate1 = fs;
% if (( Srate1~= Srate2) | ( Nbits1~= Nbits2))
% error( 'The two files do not match!\n');
% end
len= min( length( data1), length( data2));
data1= data1( 1: len);
data2= data2( 1: len);
[snr_dist, segsnr_dist]= snr( data1, data2,Srate1);
snr_mean= snr_dist;
segsnr_mean= mean( segsnr_dist);
% =========================================================================
function [overall_snr, segmental_snr] = snr(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Both Speech Files must be same length.');
return
end
% ----------------------------------------------------------------------
% Scale both clean speech and processed speech to have same dynamic
% range. Also remove DC component from each signal
% ----------------------------------------------------------------------
% clean_speech = clean_speech - mean(clean_speech);
% processed_speech = processed_speech - mean(processed_speech);
%
% processed_speech = processed_speech.*(max(abs(clean_speech))/ max(abs(processed_speech)));
overall_snr = 10* log10( sum(clean_speech.^2)/sum((clean_speech-processed_speech).^2));
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
% winlength = round(30*sample_rate/1000); %240; % window length in samples for 30-msecs
% skiprate = floor(winlength/4); %60; % window skip in samples
winlength = round(25*sample_rate/1000); %240; % window length in samples for 30-msecs
skiprate = round(10*sample_rate/1000); %60; % window skip in samples
MIN_SNR = -10; % minimum SNR in dB
MAX_SNR = 35; % maximum SNR in dB
% ----------------------------------------------------------------------
% For each frame of input speech, calculate the Segmental SNR
% ----------------------------------------------------------------------
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1: num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Compute the Segmental SNR
% ----------------------------------------------------------
signal_energy = sum(clean_frame.^2);
noise_energy = sum((clean_frame-processed_frame).^2);
segmental_snr(frame_count) = 10*log10(signal_energy/(noise_energy+eps)+eps);
segmental_snr(frame_count) = max(segmental_snr(frame_count),MIN_SNR);
segmental_snr(frame_count) = min(segmental_snr(frame_count),MAX_SNR);
start = start + skiprate;
end
分段信噪比(频域)
fwSegSNR
=
10
N
∑
n
=
1
N
∑
k
=
1
K
W
(
k
,
n
)
log
10
(
∣
S
(
k
,
n
)
∣
2
/
(
∣
S
(
k
,
n
)
∣
−
∣
S
^
(
k
,
n
)
∣
)
2
)
∑
k
=
1
K
W
(
k
,
n
)
\text{fwSegSNR}=\frac{10}{N}\sum\limits_{n=1}^{N}{\frac{\sum\limits_{k=1}^{K}{W}(k,n){{\log }_{10}}\left( |S(k,n){{|}^{2}}/{{(|S(k,n)|-|\hat{S}(k,n)|)}^{2}} \right)}{\sum\limits_{k=1}^{K}{W}(k,n)}}
fwSegSNR=N10n=1∑Nk=1∑KW(k,n)k=1∑KW(k,n)log10(∣S(k,n)∣2/(∣S(k,n)∣−∣S^(k,n)∣)2)
其中,
S
(
k
,
n
)
S(k,n)
S(k,n)和
S
^
(
k
,
n
)
\hat{S}(k,n)
S^(k,n)分别对应STFT后得到的在
k
k
k频点和第
m
m
m帧的纯净语音和增强语音信号大小。
W
(
k
,
n
)
=
∣
S
(
k
,
n
)
∣
γ
W(k,n)\text{=}{{\left| S(k,n) \right|}^{\gamma }}
W(k,n)=∣S(k,n)∣γ对应频带的加权因子,
γ
\gamma
γ一般取0.2,
N
N
N表示分帧总数目。
代码如下:
function snr = fwsegsnr(clean_speech, processed_speech, fs)
[len, ~] = size(clean_speech);
if len==1
clean_speech = clean_speech.';
processed_speech = processed_speech.';
end
snr = mean(fwseg(clean_speech, processed_speech, fs));
end
function distortion = fwseg(clean_speech, processed_speech,sample_rate)
% ----------------------------------------------------------------------
% Check the length of the clean and processed speech. Must be the same.
% ----------------------------------------------------------------------
clean_length = length(clean_speech);
processed_length = length(processed_speech);
if (clean_length ~= processed_length)
disp('Error: Files must have same length.');
return
end
% ----------------------------------------------------------------------
% Global Variables
% ----------------------------------------------------------------------
winlength = round(30*sample_rate/1000); % window length in samples
skiprate = floor(winlength/4); % window skip in samples
max_freq = sample_rate/2; % maximum bandwidth
num_crit = 25; % number of critical bands
USE_25=1;
n_fft = 2^nextpow2(2*winlength);
n_fftby2 = n_fft/2; % FFT size/2
gamma=0.2; % power exponent
% ----------------------------------------------------------------------
% Critical Band Filter Definitions (Center Frequency and Bandwidths in Hz)
% ----------------------------------------------------------------------
cent_freq(1) = 50.0000; bandwidth(1) = 70.0000;
cent_freq(2) = 120.000; bandwidth(2) = 70.0000;
cent_freq(3) = 190.000; bandwidth(3) = 70.0000;
cent_freq(4) = 260.000; bandwidth(4) = 70.0000;
cent_freq(5) = 330.000; bandwidth(5) = 70.0000;
cent_freq(6) = 400.000; bandwidth(6) = 70.0000;
cent_freq(7) = 470.000; bandwidth(7) = 70.0000;
cent_freq(8) = 540.000; bandwidth(8) = 77.3724;
cent_freq(9) = 617.372; bandwidth(9) = 86.0056;
cent_freq(10) = 703.378; bandwidth(10) = 95.3398;
cent_freq(11) = 798.717; bandwidth(11) = 105.411;
cent_freq(12) = 904.128; bandwidth(12) = 116.256;
cent_freq(13) = 1020.38; bandwidth(13) = 127.914;
cent_freq(14) = 1148.30; bandwidth(14) = 140.423;
cent_freq(15) = 1288.72; bandwidth(15) = 153.823;
cent_freq(16) = 1442.54; bandwidth(16) = 168.154;
cent_freq(17) = 1610.70; bandwidth(17) = 183.457;
cent_freq(18) = 1794.16; bandwidth(18) = 199.776;
cent_freq(19) = 1993.93; bandwidth(19) = 217.153;
cent_freq(20) = 2211.08; bandwidth(20) = 235.631;
cent_freq(21) = 2446.71; bandwidth(21) = 255.255;
cent_freq(22) = 2701.97; bandwidth(22) = 276.072;
cent_freq(23) = 2978.04; bandwidth(23) = 298.126;
cent_freq(24) = 3276.17; bandwidth(24) = 321.465;
cent_freq(25) = 3597.63; bandwidth(25) = 346.136;
W=[ % articulation index weights
0.003
0.003
0.003
0.007
0.010
0.016
0.016
0.017
0.017
0.022
0.027
0.028
0.030
0.032
0.034
0.035
0.037
0.036
0.036
0.033
0.030
0.029
0.027
0.026
0.026];
W=W';
if USE_25==0 % use 13 bands
% ----- lump adjacent filters together ----------------
k=2;
cent_freq2(1)=cent_freq(1);
bandwidth2(1)=bandwidth(1)+bandwidth(2);
W2(1)=W(1);
for i=2:13
cent_freq2(i)=cent_freq2(i-1)+bandwidth2(i-1);
bandwidth2(i)=bandwidth(k)+bandwidth(k+1);
W2(i)=0.5*(W(k)+W(k+1));
k=k+2;
end
sumW=sum(W2);
bw_min = bandwidth2 (1); % minimum critical bandwidth
else
sumW=sum(W);
bw_min=bandwidth(1);
end
% ----------------------------------------------------------------------
% Set up the critical band filters. Note here that Gaussianly shaped
% filters are used. Also, the sum of the filter weights are equivalent
% for each critical band filter. Filter less than -30 dB and set to
% zero.
% ----------------------------------------------------------------------
min_factor = exp (-30.0 / (2.0 * 2.303)); % -30 dB point of filter
if USE_25==0
num_crit=length(cent_freq2);
for i = 1:num_crit
f0 = (cent_freq2 (i) / max_freq) * (n_fftby2);
all_f0(i) = floor(f0);
bw = (bandwidth2 (i) / max_freq) * (n_fftby2);
norm_factor = log(bw_min) - log(bandwidth2(i));
j = 0:1:n_fftby2-1;
crit_filter(i,:) = exp (-11 *(((j - floor(f0)) ./bw).^2) + norm_factor);
crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:) > min_factor);
end
else
for i = 1:num_crit
f0 = (cent_freq (i) / max_freq) * (n_fftby2);
all_f0(i) = floor(f0);
bw = (bandwidth (i) / max_freq) * (n_fftby2);
norm_factor = log(bw_min) - log(bandwidth(i));
j = 0:1:n_fftby2-1;
crit_filter(i,:) = exp (-11 *(((j - floor(f0)) ./bw).^2) + norm_factor);
crit_filter(i,:) = crit_filter(i,:).*(crit_filter(i,:) > min_factor);
end
end
num_frames = clean_length/skiprate-(winlength/skiprate); % number of frames
start = 1; % starting sample
window = 0.5*(1 - cos(2*pi*(1:winlength)'/(winlength+1)));
for frame_count = 1:num_frames
% ----------------------------------------------------------
% (1) Get the Frames for the test and reference speech.
% Multiply by Hanning Window.
% ----------------------------------------------------------
clean_frame = clean_speech(start:start+winlength-1);
processed_frame = processed_speech(start:start+winlength-1);
clean_frame = clean_frame.*window;
processed_frame = processed_frame.*window;
% ----------------------------------------------------------
% (2) Compute the magnitude Spectrum of Clean and Processed
% ----------------------------------------------------------
clean_spec = abs(fft(clean_frame,n_fft));
processed_spec = abs(fft(processed_frame,n_fft));
% normalize spectra to have area of one
%
clean_spec=clean_spec/sum(clean_spec(1:n_fftby2));
processed_spec=processed_spec/sum(processed_spec(1:n_fftby2));
% ----------------------------------------------------------
% (3) Compute Filterbank Output Energies
% ----------------------------------------------------------
clean_energy=zeros(1,num_crit);
processed_energy=zeros(1,num_crit);
error_energy=zeros(1,num_crit);
W_freq=zeros(1,num_crit);
for i = 1:num_crit
clean_energy(i) = sum(clean_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
processed_energy(i) = sum(processed_spec(1:n_fftby2) ...
.*crit_filter(i,:)');
error_energy(i)=max((clean_energy(i)-processed_energy(i))^2,eps);
W_freq(i)=(clean_energy(i))^gamma;
end
SNRlog=10*log10((clean_energy.^2)./error_energy);
fwSNR=sum(W_freq.*SNRlog)/sum(W_freq);
distortion(frame_count)=min(max(fwSNR,-10),35);
start = start + skiprate;
end
end
PESQ
感知语音质量评价(Perceptual Evaluation of Speech Quality, PESQ)是评价语音质量最常用的指标之一,近似于MOS 值,但它的计算过程要复杂得多,包含了预处理、时间对齐、感知滤波、掩蔽效果等等,取值范围也改为−0.5 ∼ 4.5,PESQ 值越高则表明被测试的语音具有越好的听觉语音质量。
代码较长,请参考PESQ
也可以参考语音质量的评价指标介绍及MATLAB实现(二)这篇文章中的复合测度代码中的PESQ实现。
STOI
短时客观可懂度(Short-Time Objective Intelligibility, STOI)是衡量语音可懂度的重要指标之一。由于对于语音信号中的一个单词,只有能被听懂和不能被听懂两种情况,从这个角度可以认为可懂度是二值的,所以STOI 的取值范围也被量化在了0 ∼ 1 中,代表单词被正确理解的百分比,数值取1 时表示语音能够被充分理解。与PESQ 相比,STOI 的计算过程要相对简单得多,主要包括5个步骤
1)移除静音区:由于静音区没有需要被理解的语言内容,因此在计算STOI 指标前先移除掉信号的静音区部分;
2)STFT 变换:无论是待测试语音还是原始干净语音都需要进行STFT 变换以得到时频域特征,这样做的原因是时频域特征与人耳听觉系统中的语音特征有一定的相似之处;
3)三分之一倍频分析:该步骤主要是对时频点进行划分,以干净语音信号的时频点
S
(
k
,
n
)
S\left( k,n \right)
S(k,n)为例可得:
S
j
(
n
)
=
∑
k
∈
C
B
j
∣
S
(
k
,
n
)
∣
2
{{S}_{j}}\left( n \right)=\sqrt{\sum\limits_{k\in C{{B}_{j}}}^{{}}{{{\left| S\left( k,n \right) \right|}^{2}}}}
Sj(n)=k∈CBj∑∣S(k,n)∣2
其中,
j
=
1
,
2
,
⋯
,
J
j=1,2,\cdots ,J
j=1,2,⋯,J是三分之一倍频带的索引。
C
B
j
C{{B}_{j}}
CBj是STFT 系数中属于第
j
j
j个三分之一倍频带的索引集。基于上式可以得到干净语音的短时谱向量为
s
j
,
n
=
[
S
j
(
n
−
N
+
1
)
,
S
j
(
n
−
N
+
2
)
,
⋯
,
S
j
(
n
)
]
T
{{\mathbf{s}}_{j,n}}={{\left[ {{S}_{j}}(n-N+1),{{S}_{j}}(n-N+2),\cdots ,{{S}_{j}}(n) \right]}^{T}}
sj,n=[Sj(n−N+1),Sj(n−N+2),⋯,Sj(n)]T
通常
N
N
N取30;
4)归一化和裁剪:归一化的目的在于补偿信号整体的差异,裁剪的目的则是约束STOI 对于损坏较严重的语音的敏感度上限,通常归一化和裁剪操作都是针对待测试语音进行的,并将经过归一化和裁剪的语音短时谱向量标记为
s
~
j
,
n
{{\mathbf{\tilde{s}}}_{j,n}}
s~j,n;
5)可懂度计算:计算待测试语音和干净语音之间短时谱向量的相关系数
d
j
,
n
=
(
s
j
,
n
−
μ
s
j
,
n
)
T
(
s
~
j
,
n
−
μ
s
~
j
,
n
)
∥
s
j
,
n
−
μ
s
j
,
n
∥
2
∥
s
~
j
,
m
−
μ
s
~
j
,
n
∥
2
{{d}_{j,n}}=\frac{{{\left( {{\mathbf{s}}_{j,n}}-{{\mu }_{{{\mathbf{s}}_{j,n}}}} \right)}^{T}}\left( {{{\mathbf{\tilde{s}}}}_{j,n}}-{{\mu }_{{{{\mathbf{\tilde{s}}}}_{j,n}}}} \right)}{{{\left\| {{\mathbf{s}}_{j,n}}-{{\mu }_{{{\mathbf{s}}_{j,n}}}} \right\|}_{2}}{{\left\| {{{\mathbf{\tilde{s}}}}_{j,m}}-{{\mu }_{{{{\mathbf{\tilde{s}}}}_{j,n}}}} \right\|}_{2}}}
dj,n=∥∥sj,n−μsj,n∥∥2∥∥s~j,m−μs~j,n∥∥2(sj,n−μsj,n)T(s~j,n−μs~j,n)
其中,
μ
(
∙
)
\mu \left( \bullet \right)
μ(∙)表示对应向量的均值。最终的STOI 值则是通过对所有帧和频带的
d
j
,
n
{{d}_{j,n}}
dj,n求平均值得到
STOI=
1
J
N
∑
j
,
n
d
j
,
n
\text{STOI=}\frac{1}{JN}\sum\limits_{j,n}^{{}}{{{d}_{j,n}}}
STOI=JN1j,n∑dj,n
代码如下:
function d = stoi(x, y, fs_signal)
% d = stoi(x, y, fs_signal) returns the output of the short-time
% objective intelligibility (STOI) measure described in [1, 2], where x
% and y denote the clean and processed speech, respectively, with sample
% rate fs_signal in Hz. The output d is expected to have a monotonic
% relation with the subjective speech-intelligibility, where a higher d
% denotes better intelligible speech. See [1, 2] for more details.
%
% References:
% [1] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen 'A Short-Time
% Objective Intelligibility Measure for Time-Frequency Weighted Noisy
% Speech', ICASSP 2010, Texas, Dallas.
%
% [2] C.H.Taal, R.C.Hendriks, R.Heusdens, J.Jensen 'An Algorithm for
% Intelligibility Prediction of Time-Frequency Weighted Noisy Speech',
% IEEE Transactions on Audio, Speech, and Language Processing, 2011.
%
%
% Copyright 2009: Delft University of Technology, Signal & Information
% Processing Lab. The software is free for non-commercial use. This program
% comes WITHOUT ANY WARRANTY.
%
%
%
% Updates:
% 2011-04-26 Using the more efficient 'taa_corr' instead of 'corr'
if length(x)~=length(y)
error('x and y should have the same length');
end
% initialization
x = x(:); % clean speech column vector
y = y(:); % processed speech column vector
fs = 10000; % sample rate of proposed intelligibility measure
N_frame = 256; % window support
K = 512; % FFT size
J = 15; % Number of 1/3 octave bands
mn = 150; % Center frequency of first 1/3 octave band in Hz.
H = thirdoct(fs, K, J, mn); % Get 1/3 octave band matrix
N = 30; % Number of frames for intermediate intelligibility measure (Length analysis window)
Beta = -15; % lower SDR-bound
dyn_range = 40; % speech dynamic range
% resample signals if other samplerate is used than fs
if fs_signal ~= fs
x = resample(x, fs, fs_signal);
y = resample(y, fs, fs_signal);
end
% remove silent frames
[x y] = removeSilentFrames(x, y, dyn_range, N_frame, N_frame/2);
% apply 1/3 octave band TF-decomposition
x_hat = stdft(x, N_frame, N_frame/2, K); % apply short-time DFT to clean speech
y_hat = stdft(y, N_frame, N_frame/2, K); % apply short-time DFT to processed speech
x_hat = x_hat(:, 1:(K/2+1)).'; % take clean single-sided spectrum
y_hat = y_hat(:, 1:(K/2+1)).'; % take processed single-sided spectrum
X = zeros(J, size(x_hat, 2)); % init memory for clean speech 1/3 octave band TF-representation
Y = zeros(J, size(y_hat, 2)); % init memory for processed speech 1/3 octave band TF-representation
for i = 1:size(x_hat, 2)
X(:, i) = sqrt(H*abs(x_hat(:, i)).^2); % apply 1/3 octave bands as described in Eq.(1) [1]
Y(:, i) = sqrt(H*abs(y_hat(:, i)).^2);
end
% loop al segments of length N and obtain intermediate intelligibility measure for all TF-regions
d_interm = zeros(J, length(N:size(X, 2))); % init memory for intermediate intelligibility measure
c = 10^(-Beta/20); % constant for clipping procedure
for m = N:size(X, 2)
X_seg = X(:, (m-N+1):m); % region with length N of clean TF-units for all j
Y_seg = Y(:, (m-N+1):m); % region with length N of processed TF-units for all j
alpha = sqrt(sum(X_seg.^2, 2)./sum(Y_seg.^2, 2)); % obtain scale factor for normalizing processed TF-region for all j
aY_seg = Y_seg.*repmat(alpha, [1 N]); % obtain \alpha*Y_j(n) from Eq.(2) [1]
for j = 1:J
Y_prime = min(aY_seg(j, :), X_seg(j, :)+X_seg(j, :)*c); % apply clipping from Eq.(3)
d_interm(j, m-N+1) = taa_corr(X_seg(j, :).', Y_prime(:)); % obtain correlation coeffecient from Eq.(4) [1]
end
end
d = mean(d_interm(:)); % combine all intermediate intelligibility measures as in Eq.(4) [1]
end
%%
function [A cf] = thirdoct(fs, N_fft, numBands, mn)
% [A CF] = THIRDOCT(FS, N_FFT, NUMBANDS, MN) returns 1/3 octave band matrix
% inputs:
% FS: samplerate
% N_FFT: FFT size
% NUMBANDS: number of bands
% MN: center frequency of first 1/3 octave band
% outputs:
% A: octave band matrix
% CF: center frequencies
f = linspace(0, fs, N_fft+1);
f = f(1:(N_fft/2+1));
k = 0:(numBands-1);
cf = 2.^(k/3)*mn;
fl = sqrt((2.^(k/3)*mn).*2.^((k-1)/3)*mn);
fr = sqrt((2.^(k/3)*mn).*2.^((k+1)/3)*mn);
A = zeros(numBands, length(f));
for i = 1:(length(cf))
[a b] = min((f-fl(i)).^2);
fl(i) = f(b);
fl_ii = b;
[a b] = min((f-fr(i)).^2);
fr(i) = f(b);
fr_ii = b;
A(i,fl_ii:(fr_ii-1)) = 1;
end
rnk = sum(A, 2);
numBands = find((rnk(2:end)>=rnk(1:(end-1))) & (rnk(2:end)~=0)~=0, 1, 'last' )+1;
A = A(1:numBands, :);
cf = cf(1:numBands);
end
%%
function x_stdft = stdft(x, N, K, N_fft)
% X_STDFT = X_STDFT(X, N, K, N_FFT) returns the short-time
% hanning-windowed dft of X with frame-size N, overlap K and DFT size
% N_FFT. The columns and rows of X_STDFT denote the frame-index and
% dft-bin index, respectively.
frames = 1:K:(length(x)-N);
x_stdft = zeros(length(frames), N_fft);
w = hanning(N);
x = x(:);
for i = 1:length(frames)
ii = frames(i):(frames(i)+N-1);
x_stdft(i, :) = fft(x(ii).*w, N_fft);
end
end
%%
function [x_sil y_sil] = removeSilentFrames(x, y, range, N, K)
% [X_SIL Y_SIL] = REMOVESILENTFRAMES(X, Y, RANGE, N, K) X and Y
% are segmented with frame-length N and overlap K, where the maximum energy
% of all frames of X is determined, say X_MAX. X_SIL and Y_SIL are the
% reconstructed signals, excluding the frames, where the energy of a frame
% of X is smaller than X_MAX-RANGE
x = x(:);
y = y(:);
frames = 1:K:(length(x)-N);
w = hanning(N);
msk = zeros(size(frames));
for j = 1:length(frames)
jj = frames(j):(frames(j)+N-1);
msk(j) = 20*log10(norm(x(jj).*w)./sqrt(N));
end
msk = (msk-max(msk)+range)>0;
count = 1;
x_sil = zeros(size(x));
y_sil = zeros(size(y));
for j = 1:length(frames)
if msk(j)
jj_i = frames(j):(frames(j)+N-1);
jj_o = frames(count):(frames(count)+N-1);
x_sil(jj_o) = x_sil(jj_o) + x(jj_i).*w;
y_sil(jj_o) = y_sil(jj_o) + y(jj_i).*w;
count = count+1;
end
end
x_sil = x_sil(1:jj_o(end));
y_sil = y_sil(1:jj_o(end));
end
%%
function rho = taa_corr(x, y)
% RHO = TAA_CORR(X, Y) Returns correlation coeffecient between column
% vectors x and y. Gives same results as 'corr' from statistics toolbox.
xn = x-mean(x);
xn = xn/sqrt(sum(xn.^2));
yn = y-mean(y);
yn = yn/sqrt(sum(yn.^2));
rho = sum(xn.*yn);
end
SDR
源信号失真比(Source to Distortion Ratio, SDR)表示信号整体的失真情况
SIR
源信号干扰比(Source to Interferences Ratio, SIR)表示抑制干扰成分的能力
SAR
人造误差成分(Signal to Artifacts Ratio, SAR)表示了经算法处理生成的人造噪音的情况
SDR、SIR、SAR的计算方式如下
假设目标声源是纯净语音信号,干扰和噪声可以认为是另一个源信号,设投影算子
∏
{
x
1
,
x
2
,
⋯
,
x
k
}
\prod{\left\{ {{x}_{1}},{{x}_{2}},\cdots ,{{x}_{k}} \right\}}
∏{x1,x2,⋯,xk}
令
P
s
j
:
=
∏
{
s
j
}
{{P}_{{{s}_{j}}}}:=\prod{\left\{ {{s}_{j}} \right\}}
Psj:=∏{sj}
P
s
:
=
∏
{
(
s
j
′
)
1
≤
j
′
≤
n
}
{{P}_{s}}:=\prod{\left\{ {{\left( {{s}_{{{j}'}}} \right)}_{1\le {j}'\le n}} \right\}}
Ps:=∏{(sj′)1≤j′≤n}
P
s
,
n
:
=
∏
{
(
s
j
′
)
1
≤
j
′
≤
n
,
(
n
i
)
1
≤
i
≤
m
}
{{P}_{s,n}}:=\prod{\left\{ {{\left( {{s}_{{{j}'}}} \right)}_{1\le {j}'\le n}},{{\left( {{n}_{i}} \right)}_{1\le i\le m}} \right\}}
Ps,n:=∏{(sj′)1≤j′≤n,(ni)1≤i≤m}
其中,
n
n
n和
m
m
m代表纯净语音和混合语音的数目。显然,
P
s
j
{{P}_{{{s}_{j}}}}
Psj表示在源信号
s
j
{{s}_{j}}
sj上的正交投影,
P
s
{{P}_{s}}
Ps表示在源信号
(
s
j
′
)
1
≤
j
′
≤
n
{{\left( {{s}_{{{j}'}}} \right)}_{1\le {j}'\le n}}
(sj′)1≤j′≤n上的正交投影,
P
s
,
n
{{P}_{s,n}}
Ps,n表示在源信号
(
s
j
′
)
1
≤
j
′
≤
n
{{\left( {{s}_{{{j}'}}} \right)}_{1\le {j}'\le n}}
(sj′)1≤j′≤n和
(
n
i
)
1
≤
i
≤
m
{{\left( {{n}_{i}} \right)}_{1\le i\le m}}
(ni)1≤i≤m上的正交投影。假定
s
^
j
{{\mathbf{\hat{s}}}_{j}}
s^j是估计的源信号,则
s
^
j
{{\mathbf{\hat{s}}}_{j}}
s^j可分解如下
s
^
j
=
s
target
+
e
interf
+
e
noise
+
e
artif
{{\mathbf{\hat{s}}}_{j}}\text{=}{{\mathbf{s}}_{\text{target}}}+{{\mathbf{e}}_{\text{interf}}}+{{\mathbf{e}}_{\text{noise}}}+{{\mathbf{e}}_{\text{artif}}}
s^j=starget+einterf+enoise+eartif
其中
s
target
:
=
P
s
j
s
^
j
{{\mathbf{s}}_{\text{target }}}:={{P}_{{{s}_{j}}}}{{\widehat{\mathbf{s}}}_{j}}
starget :=Psjs
j
e
interf
:
=
P
s
s
^
j
−
P
s
j
P
s
j
{{\mathbf{e}}_{\text{interf}}}:={{P}_{\text{s}}}{{\widehat{\mathbf{s}}}_{j}}-{{P}_{{{s}_{j}}}}{{P}_{{{s}_{j}}}}
einterf:=Pss
j−PsjPsj
e
noise
:
=
P
s
,
n
s
^
j
−
P
s
s
^
j
{{\mathbf{e}}_{\text{noise }}}:={{P}_{\text{s},\mathbf{n}}}{{\widehat{\mathbf{s}}}_{j}}-{{P}_{\text{s}}}{{\widehat{\mathbf{s}}}_{j}}
enoise :=Ps,ns
j−Pss
j
e
artif
:
=
s
^
j
−
P
s
,
n
s
^
j
{{\mathbf{e}}_{\text{artif }}}:={{\widehat{\mathbf{s}}}_{j}}-{{P}_{\text{s},\mathbf{n}}}{{\widehat{\mathbf{s}}}_{j}}
eartif :=s
j−Ps,ns
j
则
SDR
=
10
log
10
∥
s
target
∥
2
∥
e
interf
+
e
noise
+
e
artif
∥
2
\text{SDR}=10{{\log }_{10}}\frac{{{\left\| {{\mathbf{s}}_{\text{target }}} \right\|}^{2}}}{{{\left\| {{\mathbf{e}}_{\text{interf }}}+{{\mathbf{e}}_{\text{noise }}}+{{\mathbf{e}}_{\text{artif }}} \right\|}^{2}}}
SDR=10log10∥einterf +enoise +eartif ∥2∥starget ∥2
SIR
=
10
log
10
∥
s
target
∥
2
∥
e
interf
∥
2
\text{SIR}=10{{\log }_{10}}\frac{{{\left\| {{\mathbf{s}}_{\text{target }}} \right\|}^{2}}}{{{\left\| {{\mathbf{e}}_{\text{interf }}} \right\|}^{2}}}
SIR=10log10∥einterf ∥2∥starget ∥2
SAR
=
10
log
10
∥
s
target
+
e
interf
+
e
noise
∥
2
∥
e
artif
∥
2
\text{SAR}=10{{\log }_{10}}\frac{{{\left\| {{\mathbf{s}}_{\text{target }}}+{{\mathbf{e}}_{\text{interf }}}+{{\mathbf{e}}_{\text{noise }}} \right\|}^{2}}}{{{\left\| {{\mathbf{e}}_{\text{artif }}} \right\|}^{2}}}
SAR=10log10∥eartif ∥2∥starget +einterf +enoise ∥2
由上述可知,SDR表示目标信号相对其余分量而言的整体降噪性能;SIR表示的是目标信号相对除去目其本身的剩余干扰分量而言的增强性能;SAR表示的是原语音信息相对于语音增强技术产生的人造分量的增强性能。
代码如下:
function [SDR,SIR,SAR,perm]=bss_eval_sources(se,s)
% BSS_EVAL_SOURCES Ordering and measurement of the separation quality for
% estimated source signals in terms of filtered true source, interference
% and artifacts.
%
% The decomposition allows a time-invariant filter distortion of length
% 512, as described in Section III.B of the reference below.
%
% [SDR,SIR,SAR,perm]=bss_eval_sources(se,s)
%
% Inputs:
% se: nsrc x nsampl matrix containing estimated sources
% s: nsrc x nsampl matrix containing true sources
%
% Outputs:
% SDR: nsrc x 1 vector of Signal to Distortion Ratios
% SIR: nsrc x 1 vector of Source to Interference Ratios
% SAR: nsrc x 1 vector of Sources to Artifacts Ratios
% perm: nsrc x 1 vector containing the best ordering of estimated sources
% in the mean SIR sense (estimated source number perm(j) corresponds to
% true source number j)
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Copyright 2008 Emmanuel Vincent
% This software is distributed under the terms of the GNU Public License
% version 3 (http://www.gnu.org/licenses/gpl.txt)
% If you find it useful, please cite the following reference:
% Emmanuel Vincent, R閙i Gribonval, and C閐ric F関otte, "Performance
% measurement in blind audio source separation," IEEE Trans. on Audio,
% Speech and Language Processing, 14(4):1462-1469, 2006.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Errors %%%
if nargin<2, error('Not enough input arguments.'); end
[nsrc,nsampl]=size(se);
[nsrc2,nsampl2]=size(s);
if nsrc2~=nsrc, error('The number of estimated sources and reference sources must be equal.'); end
if nsampl2~=nsampl, error('The estimated sources and reference sources must have the same duration.'); end
%%% Performance criteria %%%
% Computation of the criteria for all possible pair matches
SDR=zeros(nsrc,nsrc);
SIR=zeros(nsrc,nsrc);
SAR=zeros(nsrc,nsrc);
for jest=1:nsrc,
for jtrue=1:nsrc,
[s_true,e_spat,e_interf,e_artif]=bss_decomp_mtifilt(se(jest,:),s,jtrue,512);
[SDR(jest,jtrue),SIR(jest,jtrue),SAR(jest,jtrue)]=bss_source_crit(s_true,e_spat,e_interf,e_artif);
end
end
% Selection of the best ordering
perm=perms(1:nsrc);
nperm=size(perm,1);
meanSIR=zeros(nperm,1);
for p=1:nperm,
meanSIR(p)=mean(SIR((0:nsrc-1)*nsrc+perm(p,:)));
end
[meanSIR,popt]=max(meanSIR);
perm=perm(popt,:).';
SDR=SDR((0:nsrc-1).'*nsrc+perm);
SIR=SIR((0:nsrc-1).'*nsrc+perm);
SAR=SAR((0:nsrc-1).'*nsrc+perm);
return;
function [s_true,e_spat,e_interf,e_artif]=bss_decomp_mtifilt(se,s,j,flen)
% BSS_DECOMP_MTIFILT Decomposition of an estimated source image into four
% components representing respectively the true source image, spatial (or
% filtering) distortion, interference and artifacts, derived from the true
% source images using multichannel time-invariant filters.
%
% [s_true,e_spat,e_interf,e_artif]=bss_decomp_mtifilt(se,s,j,flen)
%
% Inputs:
% se: nchan x nsampl matrix containing the estimated source image (one row per channel)
% s: nsrc x nsampl x nchan matrix containing the true source images
% j: source index corresponding to the estimated source image in s
% flen: length of the multichannel time-invariant filters in samples
%
% Outputs:
% s_true: nchan x nsampl matrix containing the true source image (one row per channel)
% e_spat: nchan x nsampl matrix containing the spatial (or filtering) distortion component
% e_interf: nchan x nsampl matrix containing the interference component
% e_artif: nchan x nsampl matrix containing the artifacts component
%%% Errors %%%
if nargin<4, error('Not enough input arguments.'); end
[nchan2,nsampl2]=size(se);
[nsrc,nsampl,nchan]=size(s);
if nchan2~=nchan, error('The number of channels of the true source images and the estimated source image must be equal.'); end
if nsampl2~=nsampl, error('The duration of the true source images and the estimated source image must be equal.'); end
%%% Decomposition %%%
% True source image
s_true=[reshape(s(j,:,:),nsampl,nchan).',zeros(nchan,flen-1)];
% Spatial (or filtering) distortion
e_spat=project(se,s(j,:,:),flen)-s_true;
% Interference
e_interf=project(se,s,flen)-s_true-e_spat;
% Artifacts
e_artif=[se,zeros(nchan,flen-1)]-s_true-e_spat-e_interf;
return;
function sproj=project(se,s,flen)
% SPROJ Least-squares projection of each channel of se on the subspace
% spanned by delayed versions of the channels of s, with delays between 0
% and flen-1
[nsrc,nsampl,nchan]=size(s);
s=reshape(permute(s,[3 1 2]),nchan*nsrc,nsampl);
%%% Computing coefficients of least squares problem via FFT %%%
% Zero padding and FFT of input data
s=[s,zeros(nchan*nsrc,flen-1)];
se=[se,zeros(nchan,flen-1)];
fftlen=2^nextpow2(nsampl+flen-1);
sf=fft(s,fftlen,2);
sef=fft(se,fftlen,2);
% Inner products between delayed versions of s
G=zeros(nchan*nsrc*flen);
for k1=0:nchan*nsrc-1,
for k2=0:k1,
ssf=sf(k1+1,:).*conj(sf(k2+1,:));
ssf=real(ifft(ssf));
ss=toeplitz(ssf([1 fftlen:-1:fftlen-flen+2]),ssf(1:flen));
G(k1*flen+1:k1*flen+flen,k2*flen+1:k2*flen+flen)=ss;
G(k2*flen+1:k2*flen+flen,k1*flen+1:k1*flen+flen)=ss.';
end
end
% Inner products between se and delayed versions of s
D=zeros(nchan*nsrc*flen,nchan);
for k=0:nchan*nsrc-1,
for i=1:nchan,
ssef=sf(k+1,:).*conj(sef(i,:));
ssef=real(ifft(ssef,[],2));
D(k*flen+1:k*flen+flen,i)=ssef(:,[1 fftlen:-1:fftlen-flen+2]).';
end
end
%%% Computing projection %%%
% Distortion filters
C=G\D;
C=reshape(C,flen,nchan*nsrc,nchan);
% Filtering
sproj=zeros(nchan,nsampl+flen-1);
for k=1:nchan*nsrc,
for i=1:nchan,
sproj(i,:)=sproj(i,:)+fftfilt(C(:,k,i).',s(k,:));
end
end
return;
function [SDR,SIR,SAR]=bss_source_crit(s_true,e_spat,e_interf,e_artif)
% BSS_SOURCE_CRIT Measurement of the separation quality for a given source
% in terms of filtered true source, interference and artifacts.
%
% [SDR,SIR,SAR]=bss_source_crit(s_true,e_spat,e_interf,e_artif)
%
% Inputs:
% s_true: nchan x nsampl matrix containing the true source image (one row per channel)
% e_spat: nchan x nsampl matrix containing the spatial (or filtering) distortion component
% e_interf: nchan x nsampl matrix containing the interference component
% e_artif: nchan x nsampl matrix containing the artifacts component
%
% Outputs:
% SDR: Signal to Distortion Ratio
% SIR: Source to Interference Ratio
% SAR: Sources to Artifacts Ratio
%%% Errors %%%
if nargin<4, error('Not enough input arguments.'); end
[nchant,nsamplt]=size(s_true);
[nchans,nsampls]=size(e_spat);
[nchani,nsampli]=size(e_interf);
[nchana,nsampla]=size(e_artif);
if ~((nchant==nchans)&&(nchant==nchani)&&(nchant==nchana)), error('All the components must have the same number of channels.'); end
if ~((nsamplt==nsampls)&&(nsamplt==nsampli)&&(nsamplt==nsampla)), error('All the components must have the same duration.'); end
%%% Energy ratios %%%
s_filt=s_true+e_spat;
% SDR
SDR=10*log10(sum(sum(s_filt.^2))/sum(sum((e_interf+e_artif).^2)));
% SIR
SIR=10*log10(sum(sum(s_filt.^2))/sum(sum(e_interf.^2)));
% SAR
SAR=10*log10(sum(sum((s_filt+e_interf).^2))/sum(sum(e_artif.^2)));
return;
参考文献:
[1]I. A. McCowan and H. Bourlard, “Microphone array post-filter based on noise field coherence,” in IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 709-716, Nov. 2003, doi: 10.1109/TSA.2003.818212.
[2]Y. Hu and P. C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229-238, Jan. 2008, doi: 10.1109/TASL.2007.911054.
[3]E. Vincent, R. Gribonval and C. Fevotte, “Performance measurement in blind audio source separation,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462-1469, July 2006, doi: 10.1109/TSA.2005.858005.
[4]C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen, “An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, Sept. 2011, doi: 10.1109/TASL.2011.2114881.
[5]贾翔宇. 基于张量模型的语音增强算法研究[D].中国科学技术大学,2019.
[6]朱媛媛. 基于稀疏表示和深度学习的有监督语音增强算法研究[D].中国科学技术大学,2020.