NetEq(五) ---- 算法处理 expand

最新推荐文章于 2023-08-03 17:31:01 发布

Neil_baby

最新推荐文章于 2023-08-03 17:31:01 发布

阅读量368

点赞数

分类专栏：音视频文章标签： NetEq

本文链接：https://blog.csdn.net/yixinuestc/article/details/124910934

版权

音视频专栏收录该内容

31 篇文章 7 订阅 ¥19.90 ¥99.00

订阅专栏

超级会员免费看

expand的处理相对复杂，它是在缺少数据的情况下的情况下利用以前的数据恢复当前数据。

语音信号分为清音和浊音，浊音是声带发生时周期性震动，具有明显的周期性，对应的频率称为基音频率，对应周期称为基音周期；清音发音时声带不震动，不具备周期性；浊音包含了大多数语音信号能量。

在丢包的情况下如何预测丢包数据，主要是根据:

语音信号 = 清音 + 浊音；清音类似噪声信号，从随机噪声经过AR滤波得到，浊音是准周期信号，用上一个周期对应位置的信号替代即可，这也是Expand算法的主要思想。

所以其中一个关键点是计算基音周期，人的基音频率范围大致为60Hz~400Hz，基音周期为2.5ms~16.67ms，代码中近似认为基音周期上限为15ms，计算基音周期至少需要两个周期，所以至少需要30ms数据。

第一次Expand

在第一次丢包补偿时进入Expand::AnalyzeSignal，其它情况只用生成噪声即可，Expand::AnalyzeSignal主要目的是生成AR滤波器系数。所以第一次丢包时重点要研究Expand::AnalyzeSignal函数。

AnalyzeSignal

取历史audio数据

第一次发生Expand时需要用历史数据得出AR模型系数，首先一个需要做的事情是计算基音周期，而计算基音周期是基于相关值最大的方法，所以需要两个基音周期数据，如前面所述，基音周期最大值为为15ms，所以取30ms数据。

做相关运算

先下采样到4kHz，然后做相关，相关长度为60 samples，下采样只取了最后的248 * fs_mult个samples做下采样，所以下采样后的数据为124个samples。计算的相关值最终有54个，存放在correlation_vector中。

为什么从10samples这里开始，10samples这里对应基音周期为2.5ms，最大对应的是64 samples，对应基音周期为16ms。

峰值检测

对上述的54个相关值，做了些拟合的处理，然后挑选最大的3个值，最大峰值位置在best_correlation_index[0],

best_correlation_index[1],best_correlation_index[2]中，对应的值在best_correlation[0]、best_correlation[1]、

best_correlation[2]中。

峰值位置做微调

以上是根据相关值来挑选的峰值，这里再在上述挑选的峰值位置前后0.5ms位置，挑选误差最小的位置，基于浊音信号为周期信号，除了相关性强，相似性也要大。

用audio_history最后2.5ms数据做为基准，从相关法得到的峰值前后0.5ms范围内，取2.5ms数据，和audio_history的2.5ms基准值求绝对值之和，记录最小的位置，这样我们会得到新的三组峰值位置和最小误差值，分别记为best_distortion_index[0]~best_distortion_index[2],

best_distortion_w32[0]~best_distortion_w32[2].

为什么只取2.5ms数据，因为认为人的最小基音周期为2.5ms，相似性只用比较2.5ms数据即可。

用相关和误差综合考虑，相关值 / 误差值作为评价标准，在这三组index中选取一个最优的，记为best_index,最佳比值记为best_ratio。

distortion_lag为根据相似性得出的基音周期，correlation_lag为根据相关性得到的基音周期，二者相差很小，在0.5ms之内，max_lag_为二者的最大值。

size_t distortion_lag = best_distortion_index[best_index];
size_t correlation_lag = best_correlation_index[best_index];
max_lag_ = std::max(distortion_lag, correlation_lag);

保存ChannelParamers

struct ChannelParameters {
    ChannelParameters();
    int16_t mute_factor;
    int16_t ar_filter[kUnvoicedLpcOrder + 1];
    int16_t ar_filter_state[kUnvoicedLpcOrder];
    int16_t ar_gain;
    int16_t ar_gain_scale;
    int16_t voice_mix_factor;         /* Q14 */
    int16_t current_voice_mix_factor; /* Q14 */
    AudioVector *expand_vector0;
    AudioVector *expand_vector1;
    bool onset;
    int mute_slope; /* Q20 */
};

前面计算主要是为了计算基音周期，说到底是为了得出ChannelParaters参数，这几个参数的含义如下:

mute_factor
ar_filter

AR 滤波器系数,大小为kUnVoicedLpcOrder + 1.

ar_filter_state

audio_history最后kUnVoicedLpcOrder个数据。

ar_gain
ar_gain_scale
voice_mix_factor

上一包语音信号中浊音信号所占语音信号（清音+浊音）的幅度比例。

current_voice_mix_factor

相对于清音，当前浊音所占的比例，和voice_mix_factor概念一样，只是它记录的是当前的比例。

expand_vector0和expand_vector1

详细请看"计算expand_vector0和expand_vector1”

expand_vector1和expand_vector0分别保存audio_history最后两个maxlag+ overlap_的数据

onset
muteslope

对每一个通道的参数计算步骤如下:

计算best_index

在distortion_lag和correlation_lag之间找到相关值最大的lag,记为best_index

correlation_length = std::max(std::min(distortion_lag + kMinLag, fs_mult_120),
                                        static_cast<size_t>(kMaxLag * fs_mult));
size_t start_index = std::min(distortion_lag, correlation_lag);
size_t correlation_lags = static_cast<size_t>(
    VXAUDIO_SPL_ABS_W16((distortion_lag - correlation_lag)) + 1);

这里因为计算量不大，也未做下采样处理，相关长度计算

原生的webrtc代码如下，相关长度范围是7.5ms ~ 15ms，60 * fs_mult对应7.5ms，fs_mult_120对应15ms。distortion_lag + 10中的10没有特殊含义，只是增加一点预留量。

 correlation_length = std::max(std::min(distortion_lag + 10, fs_mult_120),
                                static_cast<size_t>(60 * fs_mult));

计算信号能量及相关系数

int32_t energy1 = VxAudioSpl_DotProductWithScale(
    &(audio_history[signal_length - correlation_length]),
    &(audio_history[signal_length - correlation_length]),
    correlation_length, correlation_scale);
int32_t energy2 = VxAudioSpl_DotProductWithScale(
    &(audio_history[signal_length - correlation_length - best_index]),
    &(audio_history[signal_length - correlation_length - best_index]),
    correlation_length, correlation_scale);

// Calculate the correlation coefficient between the two portions of the
// signal.
int32_t corr_coefficient;
if ((energy1 > 0) && (energy2 > 0)) {
    int energy1_scale = std::max(16 - VxAudioSpl_NormW32(energy1), 0);
    int energy2_scale = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
    // Make sure total scaling is even (to simplify scale factor after sqrt).
    if ((energy1_scale + energy2_scale) & 1) {
        // If sum is odd, add 1 to make it even.
        energy1_scale += 1;
    }
    int32_t scaled_energy1 = energy1 >> energy1_scale;
    int32_t scaled_energy2 = energy2 >> energy2_scale;
    int16_t sqrt_energy_product = static_cast<int16_t>(
        VxAudioSpl_SqrtFloor(scaled_energy1 * scaled_energy2));
    // Calculate max_correlation / sqrt(energy1 * energy2) in Q14.
    int cc_shift = 14 - (energy1_scale + energy2_scale) / 2;
    max_correlation = VXAUDIO_SPL_SHIFT_W32(max_correlation, cc_shift);
    corr_coefficient =
        VxAudioSpl_DivW32W16(max_correlation, sqrt_energy_product);
    // Cap at 1.0 in Q14.
    corr_coefficient = std::min(16384, corr_coefficient);
} else {
    corr_coefficient = 0;
}

计算expand_vector0和expand_vector1

这里分为两种情况，前面计算出了相邻两个基音周期的能量，energy1和energy2，这两种情况根据energy1和energy2的相对大小划分两类。

energy2 / 4<energy1 < 4 * energy2

if ((energy1 / 4 < energy2) && (energy1 > energy2 / 4)) {
    // Energy constraint fulfilled. Use both vectors and scale them
    // accordingly.
    int32_t scaled_energy2 = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
    int32_t scaled_energy1 = scaled_energy2 - 13;
    // Calculate scaled_energy1 / scaled_energy2 in Q13.
    int32_t energy_ratio =
        VxAudioSpl_DivW32W16(VXAUDIO_SPL_SHIFT_W32(energy1, -scaled_energy1),
                             static_cast<int16_t>(energy2 >> scaled_energy2));
    // Calculate sqrt ratio in Q13 (sqrt of en1/en2 in Q26).
    amplitude_ratio =
        static_cast<int16_t>(VxAudioSpl_SqrtFloor(energy_ratio << 13));
    // Copy the two vectors and give them the same energy.
    parameters.expand_vector0->Clear();
    parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t *>(vector1), expansion_length);
    parameters.expand_vector1->Clear();
    if (parameters.expand_vector1->Size() < expansion_length) {
        parameters.expand_vector1->Extend(expansion_length -
                                          parameters.expand_vector1->Size());
    }
    std::unique_ptr<int16_t[]> temp_1(new int16_t[expansion_length]);
    VxAudioSpl_AffineTransformVector(
        temp_1.get(), const_cast<int16_t *>(vector2), amplitude_ratio, 4096,
        13, expansion_length);
    parameters.expand_vector1->OverwriteAt(reinterpret_cast<const int8_t*>(temp_1.get()), expansion_length, 0);
}

 size_t expansion_length = max_lag_ + overlap_length_;

expand_vector0存放audio_history最后expansion_length个sample个数据，expansion_length在max_lag_基础上又多加了overlap_length_个samples(5 samples @ 8kHz)。

amplitude_ratio为从energy2到energy1的幅度增益，即相邻两个基音周期的幅度增益。expand_vector1存放的是expand_vector0的前一个基音周期数据乘以amplitude。

目的是使相邻两个基音周期幅度尽量一致。

除第一种情况下的其它情况

这种情况下energy1 < energy2 / 4 or energy1 > 4 * energy2。

else {
    // Energy change constraint not fulfilled. Only use last vector.
    parameters.expand_vector0->Clear();
    parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t*>(vector1), expansion_length);
    // Copy from expand_vector0 to expand_vector1.
    parameters.expand_vector0->CopyTo(parameters.expand_vector1);
    // Set the energy_ratio since it is used by muting slope.
    if ((energy1 / 4 < energy2) || (energy2 == 0)) {
        amplitude_ratio = 4096;  // 0.5 in Q13.
    } else {
        amplitude_ratio = 16384;  // 2.0 in Q13.
    }
}

expand_vector0和expand_vector1的值一样，都存放audio_history的最后expansion_length个samples。

并计算amplitude_ratio值，当energy1相比energy2太小时，设为0.5；相反，当energy1相比energy2太大时，设为2.0。

设置3个lag值

结果保存在expand_lags数组中，做为三个备选的基音周期，这个三个值相差很小，为什么这么做不得而知。

if (distortion_lag == correlation_lag) {
    expand_lags_[0] = distortion_lag;
    expand_lags_[1] = distortion_lag;
    expand_lags_[2] = distortion_lag;
} else {
    // |distortion_lag| and |correlation_lag| are not equal; use different
    // combinations of the two.
    // First lag is |distortion_lag| only.
    expand_lags_[0] = distortion_lag;
    // Second lag is the average of the two.
    expand_lags_[1] = (distortion_lag + correlation_lag) / 2;
    // Third lag is the average again, but rounding towards |correlation_lag|.
    if (distortion_lag > correlation_lag) {
        expand_lags_[2] = (distortion_lag + correlation_lag - 1) / 2;
    } else {
        expand_lags_[2] = (distortion_lag + correlation_lag + 1) / 2;
    }
}

计算AR滤波器系数

用莱文逊杜宾算法，结果保存在parameters.ar_filters中。

// Calculate the LPC and the gain of the filters.

// Calculate kUnvoicedLpcOrder + 1 lags of the auto-correlation function.
size_t temp_index =
    signal_length - fs_mult_lpc_analysis_len - kUnvoicedLpcOrder;
// Copy signal to temporary vector to be able to pad with leading zeros.
int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
       &audio_history[temp_index + kUnvoicedLpcOrder],
       sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);
delete[] temp_signal;

// Verify that variance is positive.
if (auto_correlation[0] > 0) {
    // Estimate AR filter parameters using Levinson-Durbin algorithm;
    // kUnvoicedLpcOrder + 1 filter coefficients.
    int16_t stability =
        VxAudioSpl_LevinsonDurbin(auto_correlation, parameters.ar_filter,
                                  reflection_coeff, kUnvoicedLpcOrder);

    // Keep filter parameters only if filter is stable.
    if (stability != 1) {
        MLOGW("LevinsonDurbin is unstable.");
        // Set first coefficient to 4096 (1.0 in Q12).
        parameters.ar_filter[0] = 4096;
        // Set remaining |kUnvoicedLpcOrder| coefficients to zero.
        VxAudioSpl_MemSetW16(parameters.ar_filter + 1, 0, kUnvoicedLpcOrder);
    }
}

当用莱文逊杜宾算法得出的结果不稳定，赋固定值，莱文逊杜宾算法的相关长度为fs_mult * kLpcAnalysisLength，即20ms数据。

以下代码为什么temp_signal前面kUnvoicedLpcOrder数据清0

int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
       &audio_history[temp_index + kUnvoicedLpcOrder],
       sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);

个人觉得应该改为:

int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal,
       &audio_history[temp_index],
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder ));
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);

虽然二者对结果影响不大，另外这里应该用audio_history的清音数据，因为AR模型是用于清音数据，没明白为什么这么做？

随机噪声生成

噪声从RandomVecotor::kRandomTable中提取，这是为了后面用于生成清音数据，作用和非第一次Expand中的"生成随机噪声"作用一样。

if (channel_ix == 0) {
    // Extract a noise segment.
    size_t noise_length;
    if (distortion_lag < 40) {
        noise_length = 2 * distortion_lag + 30;
    } else {
        noise_length = distortion_lag + 30;
    }
    if (noise_length <= RandomVector::kRandomTableSize) {
        memcpy(random_vector, RandomVector::kRandomTable,
               sizeof(int16_t) * noise_length);
    } else {
        // Only applies to SWB where length could be larger than
        // |kRandomTableSize|.
        memcpy(random_vector, RandomVector::kRandomTable,
               sizeof(int16_t) * RandomVector::kRandomTableSize);
        assert(noise_length <= kMaxSampleRate / 8000 * 120 + 30);
        random_vector_->IncreaseSeedIncrement(2);
        random_vector_->Generate(
            noise_length - RandomVector::kRandomTableSize,
            &random_vector[RandomVector::kRandomTableSize]);
    }
}

下图为RandomVector::kRandomTable值

保存ar_filter_state

memcpy(parameters.ar_filter_state,
       &(audio_history[signal_length - kUnvoicedLpcOrder]),
       sizeof(int16_t) * kUnvoicedLpcOrder);

把audio_history的最后kUnVoicedLpcOrder个samples赋给ar_filter_state,如下图所示。

ar_filter_state的作用是后面估计清音数据数据时，把ar_filter_state数据拷贝到unvoiced_vector前面。

保存voice_mix_factor

// Calculate voice_mix_factor from corr_coefficient.
// Let x = corr_coefficient. Then, we compute:
// if (x > 0.48)
//   voice_mix_factor = (-5179 + 19931x - 16422x^2 + 5776x^3) / 4096;
// else
//   voice_mix_factor = 0;
if (corr_coefficient > 7875) {
    int16_t x1, x2, x3;
    // |corr_coefficient| is in Q14.
    x1 = static_cast<int16_t>(corr_coefficient);
    x2 = (x1 * x1) >> 14;  // Shift 14 to keep result in Q14.
    x3 = (x1 * x2) >> 14;
    static const int kCoefficients[4] = {-5179, 19931, -16422, 5776};
    int32_t temp_sum = kCoefficients[0] * 16384;
    temp_sum += kCoefficients[1] * x1;
    temp_sum += kCoefficients[2] * x2;
    temp_sum += kCoefficients[3] * x3;
    parameters.voice_mix_factor =
        static_cast<int16_t>(std::min(temp_sum / 4096, 16384));
    parameters.voice_mix_factor =
        std::max(parameters.voice_mix_factor, static_cast<int16_t>(0));
} else {
    parameters.voice_mix_factor = 0;
}

如果相关系数大于0.48，认为前后两个基音周期相关性较强，可以理解为语音信号中大多数是浊音数据，所以用相关系数经过3次曲线拟合浊音信号所占总的语音信号（清音+浊音）的比例；否则认为前后两个基音周期相关性不强，都为清音信号，voice_mix_factor设为0.

保存mute_slope

计算方法见下面代码

分两种情况

slope > 1.5

mute_slope定义为mute factor 在distortion_lag内从1.0 减小到1 / slope的斜率。

当slope > 1.8 时，斜率再除以2；当1.5<slope <= 1.8时，斜率再除以8.

其它

slope <= 1.5，mute_slope定义为mute_factor在distortion_log从1.0减小到slope的斜率。

// Calculate muting slope. Reuse value from earlier scaling of
// |expand_vector0| and |expand_vector1|.
int16_t slope = amplitude_ratio;
if (slope > 12288) {
    // slope > 1.5.
    // Calculate (1 - (1 / slope)) / distortion_lag =
    // (slope - 1) / (distortion_lag * slope).
    // |slope| is in Q13, so 1 corresponds to 8192. Shift up to Q25 before
    // the division.
    // Shift the denominator from Q13 to Q5 before the division. The result of
    // the division will then be in Q20.
    int16_t denom = saturated_cast<int16_t>((distortion_lag * slope) >> 8);
    int temp_ratio = VxAudioSpl_DivW32W16((slope - 8192) << 12, denom);
    if (slope > 14746) {
        // slope > 1.8.
        // Divide by 2, with proper rounding.
        parameters.mute_slope = (temp_ratio + 1) / 2;
    } else {
        // Divide by 8, with proper rounding.
        parameters.mute_slope = (temp_ratio + 4) / 8;
    }
    parameters.onset = true;
} else {
            // Calculate (1 - slope) / distortion_lag.
            // Shift |slope| by 7 to Q20 before the division. The result is in Q20.
            parameters.mute_slope = VxAudioSpl_DivW32W16(
                    (8192 - slope) * 128, static_cast<int16_t>(distortion_lag));
            if (parameters.voice_mix_factor <= 13107) { // corrsponding to 0.8
                // Make sure the mute factor decreases from 1.0 to 0.9 in no more than
                // 6.25 ms.
                // mute_slope >= 0.005 / fs_mult in Q20.
                parameters.mute_slope = std::max(static_cast<int>(5243 / fs_mult), parameters.mute_slope);
            } else if (slope > 8028) { // corssponding to 0.98
                parameters.mute_slope = 0;
            }
            parameters.onset = false;
}

特别地，说明下为什么

// Make sure the mute factor decreases from 1.0 to 0.9 in no more than
// 6.25 ms.
// mute_slope >= 0.005 / fs_mult in Q20.

如果假设采样率为8kHz，则fs_mult =1 ,6.25ms对应50个samples。所以从1.0下降到0.9的斜率为(1.0 - 0.9) / 50 = 0.002,这里代码或者注释有误。要么注释应该是从1.0降到0.75，要么代码中5243改为2097.

在Expand其它代码地方有如下注释和代码，所以确认是这里笔误了。

if (consecutive_expands_ == 7) {
  // Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
  // mute_slope = 0.0020 / fs_mult in Q20.
  parameters.mute_slope = std::max(parameters.mute_slope, 2097 / fs_mult);
}

其它处理

剩下的处理过程和非第一次Expand过程"生成随机噪声"之后的处理，详见"非第一次Expand处理中的相关处理流程"。

非第一次Expand

生成随机噪声

生成长度为max_lag_的随机噪声，生成随机噪声是为了后面AR滤波器生成清音数据。

 size_t rand_length = max_lag_;
// This only applies to SWB where length could be larger than 256.
assert(rand_length <= kMaxSampleRate / 8000 * 120 + 30);
GenerateRandomVector(2, rand_length, random_vector);

更新lag index

 current_lag_index_ = current_lag_index_ + lag_index_direction_;
    // Change direction if needed.
    if (current_lag_index_ <= 0) {
        lag_index_direction_ = 1;
    }
    if (current_lag_index_ >= kNumLags - 1) {
        lag_index_direction_ = -1;
    }

获取浊音数据

对每个通道，根据current_lag_index_获取浊音数据，浊音数据主要从expand_vector0和expand_vector0中获取。

如果current_lag_index = 0,浊音数据和expand_vector0一致；如果current_lag_index = 1,浊音数据为

3/ 4 * expand_vector0 + 1 / 4 * expand_vector1;如果current_lag_index =2,浊音数据为 1 /2 * expand_vector0 + 1 /2 * expand_vector1。

结果存放在voiced_vector_storage中。

为什么这么做？

if (current_lag_index_ == 0) {
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(voiced_vector_storage));
} else if (current_lag_index_ == 1) {
    std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_0.get()));
    std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
    parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t *>(temp_1.get()));
    // Mix 3/4 of expand_vector0 with 1/4 of expand_vector1.
    VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 3, temp_1.get(), 1, 2,
                                           voiced_vector_storage, temp_length);
} else if (current_lag_index_ == 2) {

    std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_0.get()));
    std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
    parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_1.get()));
    VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 1, temp_1.get(), 1, 1,
                                           voiced_vector_storage, temp_length);
}

syncBuffer中overlap数据平滑

当mute_factor大于0.05并且current_voice_mix_factor大于0.5时，这时认为语音幅度下降需要一些时间，且浊音占的比较大于50%，平滑sync_buffer中overlap数据，用voiced_vector和sync_buffer中本身的overlap数据加权。

// Smooth the expanded if it has not been muted to a low amplitude and
    // |current_voice_mix_factor| is larger than 0.5.
    if ((parameters.mute_factor > 819) &&
        (parameters.current_voice_mix_factor > 8192)) {
      size_t start_ix = sync_buffer_->Size() - overlap_length_;
      for (size_t i = 0; i < overlap_length_; i++) {
        // Do overlap add between new vector and overlap.
        (*sync_buffer_)[channel_ix][start_ix + i] =
            (((*sync_buffer_)[channel_ix][start_ix + i] * muting_window) +
             (((parameters.mute_factor * voiced_vector_storage[i]) >> 14) *
              unmuting_window) +
             16384) >>
            15;
        muting_window += muting_window_increment;
        unmuting_window += unmuting_window_increment;
      }
    } else if (parameters.mute_factor == 0) {
      // The expanded signal will consist of only comfort noise if
      // mute_factor = 0. Set the output length to 15 ms for best noise
      // production.
      // TODO(hlundin): This has been disabled since the length of
      // parameters.expand_vector0 and parameters.expand_vector1 no longer
      // match with expand_lags_, causing invalid reads and writes. Is it a good
      // idea to enable this again, and solve the vector size problem?
      //      max_lag_ = fs_mult * 120;
      //      expand_lags_[0] = fs_mult * 120;
      //      expand_lags_[1] = fs_mult * 120;
      //      expand_lags_[2] = fs_mult * 120;
    }

获取清音数据

// Unvoiced part.
// Filter |scaled_random_vector| through |ar_filter_|.
memcpy(unvoiced_vector - kUnvoicedLpcOrder, parameters.ar_filter_state,
       sizeof(int16_t) * kUnvoicedLpcOrder);
int32_t add_constant = 0;
if (parameters.ar_gain_scale > 0) {
    add_constant = 1 << (parameters.ar_gain_scale - 1);
}
VxAudioSpl_AffineTransformVector(scaled_random_vector, random_vector,
                                 parameters.ar_gain, add_constant,
                                 parameters.ar_gain_scale, current_lag);
VxAudioSpl_FilterARFastQ12(scaled_random_vector, unvoiced_vector,
                           parameters.ar_filter, kUnvoicedLpcOrder + 1,
                           current_lag);
memcpy(parameters.ar_filter_state,
       &(unvoiced_vector[current_lag - kUnvoicedLpcOrder]),
       sizeof(int16_t) * kUnvoicedLpcOrder);

先由random_vector经过线性变换得到scaled_random_vector,scaled_random_vector经过AR 滤波器得到清音数据unvoiced_vector。

同时，parameters.ar_filter_state也更新。

清音数据和浊音数据混合

// Combine voiced and unvoiced contributions.

// Set a suitable cross-fading slope.
// For lag =
//   <= 31 * fs_mult            => go from 1 to 0 in about 8 ms;
//  (>= 31 .. <= 63) * fs_mult  => go from 1 to 0 in about 16 ms;
//   >= 64 * fs_mult            => go from 1 to 0 in about 32 ms.
// temp_shift = getbits(max_lag_) - 5.
int temp_shift =
    (31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;
int16_t mix_factor_increment = 256 >> temp_shift;
if (stop_muting_) {
    mix_factor_increment = 0;
}

// Create combined signal by shifting in more and more of unvoiced part.
temp_shift = 8 - temp_shift;  // = getbits(mix_factor_increment).
size_t temp_length =
    (parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
    temp_shift;
temp_length = std::min(temp_length, current_lag);
DspHelper::CrossFade(voiced_vector, unvoiced_vector, temp_length,
                     &parameters.current_voice_mix_factor,
                     mix_factor_increment, temp_data);

首先需要计算出需要融合的清音和浊音的长度即temp_length,它的思想大致是这样的：

max_lags_为对应4kHz采样率的lag,所以lag <= 31 * fs_mult,基因周期小于8ms;

lag >=31 && lag <=63,基因周期在8ms~16ms;

lag>=64,基因周期大于16ms~32ms. (人的基音周期范围也就是2.5ms~16ms左右)

2. temp_shift = getbits(max_lags_) - 5 ，为什么max_lags所占的bit数减5，为什么要减5？感觉是为了定点化处理，具体原因没看出来？？

int temp_shift =
    (31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;

3.认为voice_mix_factor是呈线性变化的，计算从上一次voice_mix_factor到这次的长度，即清音和浊音融合的长度。

size_t temp_length =
    (parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
    temp_shift;

parameters.current_voice_mix_factor为当前浊音所占的比重，temp_length为浊音和清音需要融合的长度，最后的结果保存在temp_data中。

为什么expand中有overlap个数据，这是为了防止浊音数据和静音数据融合时减弱起始位置的影响。

如上述示意图，voiced_vector_storage取audio_history后面current_lag + overlap_length_数据，unvoiced_vector为current_lag长度的清音数据，清音数据和浊音数据融合跳过了浊音前面overlap_length个数据。

当融合的数据长度小于current_lag时，需要对未融合的数据做处理。未融合的数据用voiced_vector和unvoiced_vector中未融合的数据加权求得。

// End of cross-fading period was reached before end of expanded signal
// path. Mix the rest with a fixed mixing factor.
if (temp_length < current_lag) {
    if (mix_factor_increment != 0) {
        parameters.current_voice_mix_factor = parameters.voice_mix_factor;
    }
    int16_t temp_scale = 16384 - parameters.current_voice_mix_factor;
    VxAudioSpl_ScaleAndAddVectorsWithRound(
        voiced_vector + temp_length, parameters.current_voice_mix_factor,
        unvoiced_vector + temp_length, temp_scale, 14,
        temp_data + temp_length, current_lag - temp_length);
}

更新muting slope

更新的依据是处理了多少次连续的expand，更新算法待研究

// Select muting slope depending on how many consecutive expands we have
// done.
if (consecutive_expands_ == 3) {
    // Let the mute factor decrease from 1.0 to 0.95 in 6.25 ms.
    // mute_slope = 0.0010 / fs_mult in Q20.
    parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(1049 / fs_mult));
}
if (consecutive_expands_ == 7) {
    // Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
    // mute_slope = 0.0020 / fs_mult in Q20.
    parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(2097 / fs_mult));
}

// Mute segment according to slope value.
if ((consecutive_expands_ != 0) || !parameters.onset) {
    // Mute to the previous level, then continue with the muting.
    VxAudioSpl_AffineTransformVector(
        temp_data, temp_data, parameters.mute_factor, 8192, 14, current_lag);

    if (!stop_muting_) {
        DspHelper::MuteSignal(temp_data, parameters.mute_slope, current_lag);

        // Shift by 6 to go from Q20 to Q14.
        // TODO(hlundin): Adding 8192 before shifting 6 steps seems wrong.
        // Legacy.
        int16_t gain = static_cast<int16_t>(
            16384 - (((current_lag * parameters.mute_slope) + 8192) >> 6));
        gain = ((gain * parameters.mute_factor) + 8192) >> 14;

        // Guard against getting stuck with very small (but sometimes audible)
        // gain.
        if ((consecutive_expands_ > 3) && (gain >= parameters.mute_factor)) {
            parameters.mute_factor = 0;
        } else {
            parameters.mute_factor = gain;
        }
    }
}

生成背景噪音

 // Background noise part.
GenerateBackgroundNoise(
    random_vector, channel_ix, channel_parameters_[channel_ix].mute_slope,
    TooManyExpands(), current_lag, unvoiced_array_memory);

结果存放在unvoiced_array_memory + kNoiseLpcOrder中。

把背景噪声加到添加temp_data,结果写到algorithm_buffer中

// Add background noise to the combined voiced-unvoiced signal.
for (size_t i = 0; i < current_lag; i++) {
    temp_data[i] = temp_data[i] + noise_vector[i];
}
if (channel_ix == 0) {
    output->AssertSize(current_lag);
} else {
    assert(output->Size() == current_lag);
}
(*output)[channel_ix].OverwriteAt(reinterpret_cast<const int8_t *>(temp_data), current_lag, 0);