expand的处理相对复杂,它是在缺少数据的情况下的情况下利用以前的数据恢复当前数据。
语音信号分为清音和浊音,浊音是声带发生时周期性震动,具有明显的周期性,对应的频率称为基音频率,对应周期称为基音周期;清音发音时声带不震动,不具备周期性;浊音包含了大多数语音信号能量。
在丢包的情况下如何预测丢包数据,主要是根据:
语音信号 = 清音 + 浊音;清音类似噪声信号,从随机噪声经过AR滤波得到,浊音是准周期信号,用上一个周期对应位置的信号替代即可,这也是Expand算法的主要思想。
所以其中一个关键点是计算基音周期,人的基音频率范围大致为60Hz~400Hz,基音周期为2.5ms~16.67ms,代码中近似认为基音周期上限为15ms,计算基音周期至少需要两个周期,所以至少需要30ms数据。
第一次Expand
在第一次丢包补偿时进入Expand::AnalyzeSignal,其它情况只用生成噪声即可,Expand::AnalyzeSignal主要目的是生成AR滤波器系数。所以第一次丢包时重点要研究Expand::AnalyzeSignal函数。
AnalyzeSignal
取历史audio数据
第一次发生Expand时需要用历史数据得出AR模型系数,首先一个需要做的事情是计算基音周期,而计算基音周期是基于相关值最大的方法,所以需要两个基音周期数据,如前面所述,基音周期最大值为为15ms,所以取30ms数据。
相关代码如下:
const size_t signal_length = static_cast<size_t>(256 * fs_mult);
const size_t audio_history_position = sync_buffer_->Size() - signal_length;
std::unique_ptr<int16_t[]> audio_history(new int16_t[signal_length]);
(*sync_buffer_)[0].CopyTo(signal_length, audio_history_position,
audio_history.get());
为什么这里取256 * fs_mult个数据,理论上至少需要30ms数据,即240 * fs_mult,但是因为还有5 samples的overlap,所以至少需要245 * fs_mult,为了便于计算最后取了256 * fs_mult个数据。
做相关运算
先下采样到4kHz,然后做相关,相关长度为60 samples,下采样只取了最后的248 * fs_mult个samples做下采样,所以下采样后的数据为124个samples。计算的相关值最终有54个,存放在correlation_vector中。
为什么从10samples这里开始,10samples这里对应基音周期为2.5ms,最大对应的是64 samples,对应基音周期为16ms。
峰值检测
对上述的54个相关值,做了些拟合的处理,然后挑选最大的3个值,最大峰值位置在best_correlation_index[0],
best_correlation_index[1],best_correlation_index[2]中,对应的值在best_correlation[0]、best_correlation[1]、
best_correlation[2]中。
峰值位置做微调
以上是根据相关值来挑选的峰值,这里再在上述挑选的峰值位置前后0.5ms位置,挑选误差最小的位置,基于浊音信号为周期信号,除了相关性强,相似性也要大。
用audio_history最后2.5ms数据做为基准,从相关法得到的峰值前后0.5ms范围内,取2.5ms数据,和audio_history的2.5ms基准值求绝对值之和,记录最小的位置,这样我们会得到新的三组峰值位置和最小误差值,分别记为best_distortion_index[0]~best_distortion_index[2],
best_distortion_w32[0]~best_distortion_w32[2].
为什么只取2.5ms数据,因为认为人的最小基音周期为2.5ms,相似性只用比较2.5ms数据即可。
用相关和误差综合考虑,相关值 / 误差值作为评价标准,在这三组index中选取一个最优的,记为best_index,最佳比值记为best_ratio。
distortion_lag为根据相似性得出的基音周期,correlation_lag为根据相关性得到的基音周期,二者相差很小,在0.5ms之内,max_lag_为二者的最大值。
size_t distortion_lag = best_distortion_index[best_index];
size_t correlation_lag = best_correlation_index[best_index];
max_lag_ = std::max(distortion_lag, correlation_lag);
保存ChannelParamers
struct ChannelParameters {
ChannelParameters();
int16_t mute_factor;
int16_t ar_filter[kUnvoicedLpcOrder + 1];
int16_t ar_filter_state[kUnvoicedLpcOrder];
int16_t ar_gain;
int16_t ar_gain_scale;
int16_t voice_mix_factor; /* Q14 */
int16_t current_voice_mix_factor; /* Q14 */
AudioVector *expand_vector0;
AudioVector *expand_vector1;
bool onset;
int mute_slope; /* Q20 */
};
前面计算主要是为了计算基音周期,说到底是为了得出ChannelParaters参数,这几个参数的含义如下:
- mute_factor
- ar_filter
AR 滤波器系数,大小为kUnVoicedLpcOrder + 1.
- ar_filter_state
audio_history最后kUnVoicedLpcOrder个数据。
- ar_gain
- ar_gain_scale
- voice_mix_factor
上一包语音信号中浊音信号所占语音信号(清音+浊音)的幅度比例。
- current_voice_mix_factor
相对于清音,当前浊音所占的比例,和voice_mix_factor概念一样,只是它记录的是当前的比例。
- expand_vector0和expand_vector1
详细请看"计算expand_vector0和expand_vector1”
expand_vector1和expand_vector0分别保存audio_history最后两个maxlag+ overlap_的数据
- onset
- muteslope
对每一个通道的参数计算步骤如下:
计算best_index
在distortion_lag和correlation_lag之间找到相关值最大的lag,记为best_index
correlation_length = std::max(std::min(distortion_lag + kMinLag, fs_mult_120),
static_cast<size_t>(kMaxLag * fs_mult));
size_t start_index = std::min(distortion_lag, correlation_lag);
size_t correlation_lags = static_cast<size_t>(
VXAUDIO_SPL_ABS_W16((distortion_lag - correlation_lag)) + 1);
这里因为计算量不大,也未做下采样处理,相关长度计算
原生的webrtc代码如下,相关长度范围是7.5ms ~ 15ms,60 * fs_mult对应7.5ms,fs_mult_120对应15ms。distortion_lag + 10中的10没有特殊含义,只是增加一点预留量。
correlation_length = std::max(std::min(distortion_lag + 10, fs_mult_120),
static_cast<size_t>(60 * fs_mult));
计算信号能量及相关系数
int32_t energy1 = VxAudioSpl_DotProductWithScale(
&(audio_history[signal_length - correlation_length]),
&(audio_history[signal_length - correlation_length]),
correlation_length, correlation_scale);
int32_t energy2 = VxAudioSpl_DotProductWithScale(
&(audio_history[signal_length - correlation_length - best_index]),
&(audio_history[signal_length - correlation_length - best_index]),
correlation_length, correlation_scale);
// Calculate the correlation coefficient between the two portions of the
// signal.
int32_t corr_coefficient;
if ((energy1 > 0) && (energy2 > 0)) {
int energy1_scale = std::max(16 - VxAudioSpl_NormW32(energy1), 0);
int energy2_scale = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
// Make sure total scaling is even (to simplify scale factor after sqrt).
if ((energy1_scale + energy2_scale) & 1) {
// If sum is odd, add 1 to make it even.
energy1_scale += 1;
}
int32_t scaled_energy1 = energy1 >> energy1_scale;
int32_t scaled_energy2 = energy2 >> energy2_scale;
int16_t sqrt_energy_product = static_cast<int16_t>(
VxAudioSpl_SqrtFloor(scaled_energy1 * scaled_energy2));
// Calculate max_correlation / sqrt(energy1 * energy2) in Q14.
int cc_shift = 14 - (energy1_scale + energy2_scale) / 2;
max_correlation = VXAUDIO_SPL_SHIFT_W32(max_correlation, cc_shift);
corr_coefficient =
VxAudioSpl_DivW32W16(max_correlation, sqrt_energy_product);
// Cap at 1.0 in Q14.
corr_coefficient = std::min(16384, corr_coefficient);
} else {
corr_coefficient = 0;
}
计算expand_vector0和expand_vector1
这里分为两种情况,前面计算出了相邻两个基音周期的能量,energy1和energy2,这两种情况根据energy1和energy2的相对大小划分两类。
- energy2 / 4<energy1 < 4 * energy2
if ((energy1 / 4 < energy2) && (energy1 > energy2 / 4)) {
// Energy constraint fulfilled. Use both vectors and scale them
// accordingly.
int32_t scaled_energy2 = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
int32_t scaled_energy1 = scaled_energy2 - 13;
// Calculate scaled_energy1 / scaled_energy2 in Q13.
int32_t energy_ratio =
VxAudioSpl_DivW32W16(VXAUDIO_SPL_SHIFT_W32(energy1, -scaled_energy1),
static_cast<int16_t>(energy2 >> scaled_energy2));
// Calculate sqrt ratio in Q13 (sqrt of en1/en2 in Q26).
amplitude_ratio =
static_cast<int16_t>(VxAudioSpl_SqrtFloor(energy_ratio << 13));
// Copy the two vectors and give them the same energy.
parameters.expand_vector0->Clear();
parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t *>(vector1), expansion_length);
parameters.expand_vector1->Clear();
if (parameters.expand_vector1->Size() < expansion_length) {
parameters.expand_vector1->Extend(expansion_length -
parameters.expand_vector1->Size());
}
std::unique_ptr<int16_t[]> temp_1(new int16_t[expansion_length]);
VxAudioSpl_AffineTransformVector(
temp_1.get(), const_cast<int16_t *>(vector2), amplitude_ratio, 4096,
13, expansion_length);
parameters.expand_vector1->OverwriteAt(reinterpret_cast<const int8_t*>(temp_1.get()), expansion_length, 0);
}
size_t expansion_length = max_lag_ + overlap_length_;
expand_vector0存放audio_history最后expansion_length个sample个数据,expansion_length在max_lag_基础上又多加了overlap_length_个samples(5 samples @ 8kHz)。
amplitude_ratio为从energy2到energy1的幅度增益,即相邻两个基音周期的幅度增益。expand_vector1存放的是expand_vector0的前一个基音周期数据乘以amplitude。
目的是使相邻两个基音周期幅度尽量一致。
- 除第一种情况下的其它情况
这种情况下energy1 < energy2 / 4 or energy1 > 4 * energy2。
else {
// Energy change constraint not fulfilled. Only use last vector.
parameters.expand_vector0->Clear();
parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t*>(vector1), expansion_length);
// Copy from expand_vector0 to expand_vector1.
parameters.expand_vector0->CopyTo(parameters.expand_vector1);
// Set the energy_ratio since it is used by muting slope.
if ((energy1 / 4 < energy2) || (energy2 == 0)) {
amplitude_ratio = 4096; // 0.5 in Q13.
} else {
amplitude_ratio = 16384; // 2.0 in Q13.
}
}
expand_vector0和expand_vector1的值一样,都存放audio_history的最后expansion_length个samples。
并计算amplitude_ratio值,当energy1相比energy2太小时,设为0.5;相反,当energy1相比energy2太大时,设为2.0。
设置3个lag值
结果保存在expand_lags数组中,做为三个备选的基音周期,这个三个值相差很小,为什么这么做不得而知。
if (distortion_lag == correlation_lag) {
expand_lags_[0] = distortion_lag;
expand_lags_[1] = distortion_lag;
expand_lags_[2] = distortion_lag;
} else {
// |distortion_lag| and |correlation_lag| are not equal; use different
// combinations of the two.
// First lag is |distortion_lag| only.
expand_lags_[0] = distortion_lag;
// Second lag is the average of the two.
expand_lags_[1] = (distortion_lag + correlation_lag) / 2;
// Third lag is the average again, but rounding towards |correlation_lag|.
if (distortion_lag > correlation_lag) {
expand_lags_[2] = (distortion_lag + correlation_lag - 1) / 2;
} else {
expand_lags_[2] = (distortion_lag + correlation_lag + 1) / 2;
}
}
计算AR滤波器系数
用莱文逊杜宾算法,结果保存在parameters.ar_filters中。
// Calculate the LPC and the gain of the filters.
// Calculate kUnvoicedLpcOrder + 1 lags of the auto-correlation function.
size_t temp_index =
signal_length - fs_mult_lpc_analysis_len - kUnvoicedLpcOrder;
// Copy signal to temporary vector to be able to pad with leading zeros.
int16_t *temp_signal =
new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
&audio_history[temp_index + kUnvoicedLpcOrder],
sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
&temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);
delete[] temp_signal;
// Verify that variance is positive.
if (auto_correlation[0] > 0) {
// Estimate AR filter parameters using Levinson-Durbin algorithm;
// kUnvoicedLpcOrder + 1 filter coefficients.
int16_t stability =
VxAudioSpl_LevinsonDurbin(auto_correlation, parameters.ar_filter,
reflection_coeff, kUnvoicedLpcOrder);
// Keep filter parameters only if filter is stable.
if (stability != 1) {
MLOGW("LevinsonDurbin is unstable.");
// Set first coefficient to 4096 (1.0 in Q12).
parameters.ar_filter[0] = 4096;
// Set remaining |kUnvoicedLpcOrder| coefficients to zero.
VxAudioSpl_MemSetW16(parameters.ar_filter + 1, 0, kUnvoicedLpcOrder);
}
}
当用莱文逊杜宾算法得出的结果不稳定,赋固定值,莱文逊杜宾算法的相关长度为fs_mult * kLpcAnalysisLength,即20ms数据。
以下代码为什么temp_signal前面kUnvoicedLpcOrder数据清0
int16_t *temp_signal =
new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
&audio_history[temp_index + kUnvoicedLpcOrder],
sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
&temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);
个人觉得应该改为:
int16_t *temp_signal =
new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal,
&audio_history[temp_index],
sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder ));
CrossCorrelationWithAutoShift(
&temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);
虽然二者对结果影响不大,另外这里应该用audio_history的清音数据,因为AR模型是用于清音数据,没明白为什么这么做?
随机噪声生成
噪声从RandomVecotor::kRandomTable中提取,这是为了后面用于生成清音数据,作用和非第一次Expand中的"生成随机噪声"作用一样。
if (channel_ix == 0) {
// Extract a noise segment.
size_t noise_length;
if (distortion_lag < 40) {
noise_length = 2 * distortion_lag + 30;
} else {
noise_length = distortion_lag + 30;
}
if (noise_length <= RandomVector::kRandomTableSize) {
memcpy(random_vector, RandomVector::kRandomTable,
sizeof(int16_t) * noise_length);
} else {
// Only applies to SWB where length could be larger than
// |kRandomTableSize|.
memcpy(random_vector, RandomVector::kRandomTable,
sizeof(int16_t) * RandomVector::kRandomTableSize);
assert(noise_length <= kMaxSampleRate / 8000 * 120 + 30);
random_vector_->IncreaseSeedIncrement(2);
random_vector_->Generate(
noise_length - RandomVector::kRandomTableSize,
&random_vector[RandomVector::kRandomTableSize]);
}
}
下图为RandomVector::kRandomTable值
保存ar_filter_state
memcpy(parameters.ar_filter_state,
&(audio_history[signal_length - kUnvoicedLpcOrder]),
sizeof(int16_t) * kUnvoicedLpcOrder);
把audio_history的最后kUnVoicedLpcOrder个samples赋给ar_filter_state,如下图所示。
ar_filter_state的作用是后面估计清音数据数据时,把ar_filter_state数据拷贝到unvoiced_vector前面。
保存voice_mix_factor
// Calculate voice_mix_factor from corr_coefficient.
// Let x = corr_coefficient. Then, we compute:
// if (x > 0.48)
// voice_mix_factor = (-5179 + 19931x - 16422x^2 + 5776x^3) / 4096;
// else
// voice_mix_factor = 0;
if (corr_coefficient > 7875) {
int16_t x1, x2, x3;
// |corr_coefficient| is in Q14.
x1 = static_cast<int16_t>(corr_coefficient);
x2 = (x1 * x1) >> 14; // Shift 14 to keep result in Q14.
x3 = (x1 * x2) >> 14;
static const int kCoefficients[4] = {-5179, 19931, -16422, 5776};
int32_t temp_sum = kCoefficients[0] * 16384;
temp_sum += kCoefficients[1] * x1;
temp_sum += kCoefficients[2] * x2;
temp_sum += kCoefficients[3] * x3;
parameters.voice_mix_factor =
static_cast<int16_t>(std::min(temp_sum / 4096, 16384));
parameters.voice_mix_factor =
std::max(parameters.voice_mix_factor, static_cast<int16_t>(0));
} else {
parameters.voice_mix_factor = 0;
}
如果相关系数大于0.48,认为前后两个基音周期相关性较强,可以理解为语音信号中大多数是浊音数据,所以用相关系数经过3次曲线拟合浊音信号所占总的语音信号(清音+浊音)的比例;否则认为前后两个基音周期相关性不强,都为清音信号,voice_mix_factor设为0.
保存mute_slope
计算方法见下面代码
分两种情况
- slope > 1.5
mute_slope定义为mute factor 在distortion_lag内从1.0 减小到1 / slope的斜率。
当slope > 1.8 时,斜率再除以2;当1.5<slope <= 1.8时,斜率再除以8.
- 其它
slope <= 1.5,mute_slope定义为mute_factor在distortion_log从1.0减小到slope的斜率。
// Calculate muting slope. Reuse value from earlier scaling of
// |expand_vector0| and |expand_vector1|.
int16_t slope = amplitude_ratio;
if (slope > 12288) {
// slope > 1.5.
// Calculate (1 - (1 / slope)) / distortion_lag =
// (slope - 1) / (distortion_lag * slope).
// |slope| is in Q13, so 1 corresponds to 8192. Shift up to Q25 before
// the division.
// Shift the denominator from Q13 to Q5 before the division. The result of
// the division will then be in Q20.
int16_t denom = saturated_cast<int16_t>((distortion_lag * slope) >> 8);
int temp_ratio = VxAudioSpl_DivW32W16((slope - 8192) << 12, denom);
if (slope > 14746) {
// slope > 1.8.
// Divide by 2, with proper rounding.
parameters.mute_slope = (temp_ratio + 1) / 2;
} else {
// Divide by 8, with proper rounding.
parameters.mute_slope = (temp_ratio + 4) / 8;
}
parameters.onset = true;
} else {
// Calculate (1 - slope) / distortion_lag.
// Shift |slope| by 7 to Q20 before the division. The result is in Q20.
parameters.mute_slope = VxAudioSpl_DivW32W16(
(8192 - slope) * 128, static_cast<int16_t>(distortion_lag));
if (parameters.voice_mix_factor <= 13107) { // corrsponding to 0.8
// Make sure the mute factor decreases from 1.0 to 0.9 in no more than
// 6.25 ms.
// mute_slope >= 0.005 / fs_mult in Q20.
parameters.mute_slope = std::max(static_cast<int>(5243 / fs_mult), parameters.mute_slope);
} else if (slope > 8028) { // corssponding to 0.98
parameters.mute_slope = 0;
}
parameters.onset = false;
}
特别地,说明下为什么
// Make sure the mute factor decreases from 1.0 to 0.9 in no more than
// 6.25 ms.
// mute_slope >= 0.005 / fs_mult in Q20.
如果假设采样率为8kHz,则fs_mult =1 ,6.25ms对应50个samples。所以从1.0下降到0.9的斜率为(1.0 - 0.9) / 50 = 0.002,这里代码或者注释有误。要么注释应该是从1.0降到0.75,要么代码中5243改为2097.
在Expand其它代码地方有如下注释和代码,所以确认是这里笔误了。
if (consecutive_expands_ == 7) {
// Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
// mute_slope = 0.0020 / fs_mult in Q20.
parameters.mute_slope = std::max(parameters.mute_slope, 2097 / fs_mult);
}
其它处理
剩下的处理过程和非第一次Expand过程"生成随机噪声"之后的处理,详见"非第一次Expand处理中的相关处理流程"。
非第一次Expand
生成随机噪声
生成长度为max_lag_的随机噪声,生成随机噪声是为了后面AR滤波器生成清音数据。
size_t rand_length = max_lag_;
// This only applies to SWB where length could be larger than 256.
assert(rand_length <= kMaxSampleRate / 8000 * 120 + 30);
GenerateRandomVector(2, rand_length, random_vector);
更新lag index
current_lag_index_ = current_lag_index_ + lag_index_direction_;
// Change direction if needed.
if (current_lag_index_ <= 0) {
lag_index_direction_ = 1;
}
if (current_lag_index_ >= kNumLags - 1) {
lag_index_direction_ = -1;
}
获取浊音数据
对每个通道,根据current_lag_index_获取浊音数据,浊音数据主要从expand_vector0和expand_vector0中获取。
如果current_lag_index = 0,浊音数据和expand_vector0一致;如果current_lag_index = 1,浊音数据为
3/ 4 * expand_vector0 + 1 / 4 * expand_vector1;如果current_lag_index =2,浊音数据为 1 /2 * expand_vector0 + 1 /2 * expand_vector1。
结果存放在voiced_vector_storage中。
为什么这么做?
if (current_lag_index_ == 0) {
parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
reinterpret_cast<int8_t*>(voiced_vector_storage));
} else if (current_lag_index_ == 1) {
std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
reinterpret_cast<int8_t*>(temp_0.get()));
std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
reinterpret_cast<int8_t *>(temp_1.get()));
// Mix 3/4 of expand_vector0 with 1/4 of expand_vector1.
VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 3, temp_1.get(), 1, 2,
voiced_vector_storage, temp_length);
} else if (current_lag_index_ == 2) {
std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
reinterpret_cast<int8_t*>(temp_0.get()));
std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
reinterpret_cast<int8_t*>(temp_1.get()));
VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 1, temp_1.get(), 1, 1,
voiced_vector_storage, temp_length);
}
syncBuffer中overlap数据平滑
当mute_factor大于0.05并且current_voice_mix_factor大于0.5时,这时认为语音幅度下降需要一些时间,且浊音占的比较大于50%,平滑sync_buffer中overlap数据,用voiced_vector和sync_buffer中本身的overlap数据加权。
// Smooth the expanded if it has not been muted to a low amplitude and
// |current_voice_mix_factor| is larger than 0.5.
if ((parameters.mute_factor > 819) &&
(parameters.current_voice_mix_factor > 8192)) {
size_t start_ix = sync_buffer_->Size() - overlap_length_;
for (size_t i = 0; i < overlap_length_; i++) {
// Do overlap add between new vector and overlap.
(*sync_buffer_)[channel_ix][start_ix + i] =
(((*sync_buffer_)[channel_ix][start_ix + i] * muting_window) +
(((parameters.mute_factor * voiced_vector_storage[i]) >> 14) *
unmuting_window) +
16384) >>
15;
muting_window += muting_window_increment;
unmuting_window += unmuting_window_increment;
}
} else if (parameters.mute_factor == 0) {
// The expanded signal will consist of only comfort noise if
// mute_factor = 0. Set the output length to 15 ms for best noise
// production.
// TODO(hlundin): This has been disabled since the length of
// parameters.expand_vector0 and parameters.expand_vector1 no longer
// match with expand_lags_, causing invalid reads and writes. Is it a good
// idea to enable this again, and solve the vector size problem?
// max_lag_ = fs_mult * 120;
// expand_lags_[0] = fs_mult * 120;
// expand_lags_[1] = fs_mult * 120;
// expand_lags_[2] = fs_mult * 120;
}
获取清音数据
// Unvoiced part.
// Filter |scaled_random_vector| through |ar_filter_|.
memcpy(unvoiced_vector - kUnvoicedLpcOrder, parameters.ar_filter_state,
sizeof(int16_t) * kUnvoicedLpcOrder);
int32_t add_constant = 0;
if (parameters.ar_gain_scale > 0) {
add_constant = 1 << (parameters.ar_gain_scale - 1);
}
VxAudioSpl_AffineTransformVector(scaled_random_vector, random_vector,
parameters.ar_gain, add_constant,
parameters.ar_gain_scale, current_lag);
VxAudioSpl_FilterARFastQ12(scaled_random_vector, unvoiced_vector,
parameters.ar_filter, kUnvoicedLpcOrder + 1,
current_lag);
memcpy(parameters.ar_filter_state,
&(unvoiced_vector[current_lag - kUnvoicedLpcOrder]),
sizeof(int16_t) * kUnvoicedLpcOrder);
先由random_vector经过线性变换得到scaled_random_vector,scaled_random_vector经过AR 滤波器得到清音数据unvoiced_vector。
同时,parameters.ar_filter_state也更新。
清音数据和浊音数据混合
// Combine voiced and unvoiced contributions.
// Set a suitable cross-fading slope.
// For lag =
// <= 31 * fs_mult => go from 1 to 0 in about 8 ms;
// (>= 31 .. <= 63) * fs_mult => go from 1 to 0 in about 16 ms;
// >= 64 * fs_mult => go from 1 to 0 in about 32 ms.
// temp_shift = getbits(max_lag_) - 5.
int temp_shift =
(31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;
int16_t mix_factor_increment = 256 >> temp_shift;
if (stop_muting_) {
mix_factor_increment = 0;
}
// Create combined signal by shifting in more and more of unvoiced part.
temp_shift = 8 - temp_shift; // = getbits(mix_factor_increment).
size_t temp_length =
(parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
temp_shift;
temp_length = std::min(temp_length, current_lag);
DspHelper::CrossFade(voiced_vector, unvoiced_vector, temp_length,
¶meters.current_voice_mix_factor,
mix_factor_increment, temp_data);
首先需要计算出需要融合的清音和浊音的长度即temp_length,它的思想大致是这样的:
- max_lags_为对应4kHz采样率的lag,所以lag <= 31 * fs_mult,基因周期小于8ms;
lag >=31 && lag <=63,基因周期在8ms~16ms;
lag>=64,基因周期大于16ms~32ms. (人的基音周期范围也就是2.5ms~16ms左右)
2. temp_shift = getbits(max_lags_) - 5 ,为什么max_lags所占的bit数减5,为什么要减5?感觉是为了定点化处理,具体原因没看出来??
int temp_shift =
(31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;
3.认为voice_mix_factor是呈线性变化的,计算从上一次voice_mix_factor到这次的长度,即清音和浊音融合的长度。
size_t temp_length =
(parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
temp_shift;
parameters.current_voice_mix_factor为当前浊音所占的比重,temp_length为浊音和清音需要融合的长度,最后的结果保存在temp_data中。
为什么expand中有overlap个数据,这是为了防止浊音数据和静音数据融合时减弱起始位置的影响。
如上述示意图,voiced_vector_storage取audio_history后面current_lag + overlap_length_数据,unvoiced_vector为current_lag长度的清音数据,清音数据和浊音数据融合跳过了浊音前面overlap_length个数据。
当融合的数据长度小于current_lag时,需要对未融合的数据做处理。未融合的数据用voiced_vector和unvoiced_vector中未融合的数据加权求得。
// End of cross-fading period was reached before end of expanded signal
// path. Mix the rest with a fixed mixing factor.
if (temp_length < current_lag) {
if (mix_factor_increment != 0) {
parameters.current_voice_mix_factor = parameters.voice_mix_factor;
}
int16_t temp_scale = 16384 - parameters.current_voice_mix_factor;
VxAudioSpl_ScaleAndAddVectorsWithRound(
voiced_vector + temp_length, parameters.current_voice_mix_factor,
unvoiced_vector + temp_length, temp_scale, 14,
temp_data + temp_length, current_lag - temp_length);
}
更新muting slope
更新的依据是处理了多少次连续的expand,更新算法待研究
// Select muting slope depending on how many consecutive expands we have
// done.
if (consecutive_expands_ == 3) {
// Let the mute factor decrease from 1.0 to 0.95 in 6.25 ms.
// mute_slope = 0.0010 / fs_mult in Q20.
parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(1049 / fs_mult));
}
if (consecutive_expands_ == 7) {
// Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
// mute_slope = 0.0020 / fs_mult in Q20.
parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(2097 / fs_mult));
}
// Mute segment according to slope value.
if ((consecutive_expands_ != 0) || !parameters.onset) {
// Mute to the previous level, then continue with the muting.
VxAudioSpl_AffineTransformVector(
temp_data, temp_data, parameters.mute_factor, 8192, 14, current_lag);
if (!stop_muting_) {
DspHelper::MuteSignal(temp_data, parameters.mute_slope, current_lag);
// Shift by 6 to go from Q20 to Q14.
// TODO(hlundin): Adding 8192 before shifting 6 steps seems wrong.
// Legacy.
int16_t gain = static_cast<int16_t>(
16384 - (((current_lag * parameters.mute_slope) + 8192) >> 6));
gain = ((gain * parameters.mute_factor) + 8192) >> 14;
// Guard against getting stuck with very small (but sometimes audible)
// gain.
if ((consecutive_expands_ > 3) && (gain >= parameters.mute_factor)) {
parameters.mute_factor = 0;
} else {
parameters.mute_factor = gain;
}
}
}
生成背景噪音
// Background noise part.
GenerateBackgroundNoise(
random_vector, channel_ix, channel_parameters_[channel_ix].mute_slope,
TooManyExpands(), current_lag, unvoiced_array_memory);
结果存放在unvoiced_array_memory + kNoiseLpcOrder中。
把背景噪声加到添加temp_data,结果写到algorithm_buffer中
// Add background noise to the combined voiced-unvoiced signal.
for (size_t i = 0; i < current_lag; i++) {
temp_data[i] = temp_data[i] + noise_vector[i];
}
if (channel_ix == 0) {
output->AssertSize(current_lag);
} else {
assert(output->Size() == current_lag);
}
(*output)[channel_ix].OverwriteAt(reinterpret_cast<const int8_t *>(temp_data), current_lag, 0);