NetEq(五) ---- 算法处理 expand

expand的处理相对复杂,它是在缺少数据的情况下的情况下利用以前的数据恢复当前数据。

语音信号分为清音和浊音,浊音是声带发生时周期性震动,具有明显的周期性,对应的频率称为基音频率,对应周期称为基音周期;清音发音时声带不震动,不具备周期性;浊音包含了大多数语音信号能量。

在丢包的情况下如何预测丢包数据,主要是根据:

语音信号 = 清音 + 浊音;清音类似噪声信号,从随机噪声经过AR滤波得到,浊音是准周期信号,用上一个周期对应位置的信号替代即可,这也是Expand算法的主要思想。

所以其中一个关键点是计算基音周期,人的基音频率范围大致为60Hz~400Hz,基音周期为2.5ms~16.67ms,代码中近似认为基音周期上限为15ms,计算基音周期至少需要两个周期,所以至少需要30ms数据。

第一次Expand

在第一次丢包补偿时进入Expand::AnalyzeSignal,其它情况只用生成噪声即可,Expand::AnalyzeSignal主要目的是生成AR滤波器系数。所以第一次丢包时重点要研究Expand::AnalyzeSignal函数。

AnalyzeSignal

取历史audio数据

第一次发生Expand时需要用历史数据得出AR模型系数,首先一个需要做的事情是计算基音周期,而计算基音周期是基于相关值最大的方法,所以需要两个基音周期数据,如前面所述,基音周期最大值为为15ms,所以取30ms数据。

相关代码如下:

const size_t signal_length = static_cast<size_t>(256 * fs_mult); 
const size_t audio_history_position = sync_buffer_->Size() - signal_length; 
std::unique_ptr<int16_t[]> audio_history(new int16_t[signal_length]);
  (*sync_buffer_)[0].CopyTo(signal_length, audio_history_position,
                            audio_history.get()); 

为什么这里取256 * fs_mult个数据,理论上至少需要30ms数据,即240 * fs_mult,但是因为还有5 samples的overlap,所以至少需要245 * fs_mult,为了便于计算最后取了256 * fs_mult个数据。

做相关运算

先下采样到4kHz,然后做相关,相关长度为60 samples,下采样只取了最后的248 * fs_mult个samples做下采样,所以下采样后的数据为124个samples。计算的相关值最终有54个,存放在correlation_vector中。

为什么从10samples这里开始,10samples这里对应基音周期为2.5ms,最大对应的是64 samples,对应基音周期为16ms。

峰值检测

对上述的54个相关值,做了些拟合的处理,然后挑选最大的3个值,最大峰值位置在best_correlation_index[0],

best_correlation_index[1],best_correlation_index[2]中,对应的值在best_correlation[0]、best_correlation[1]、

best_correlation[2]中。

峰值位置做微调

以上是根据相关值来挑选的峰值,这里再在上述挑选的峰值位置前后0.5ms位置,挑选误差最小的位置,基于浊音信号为周期信号,除了相关性强,相似性也要大。

用audio_history最后2.5ms数据做为基准,从相关法得到的峰值前后0.5ms范围内,取2.5ms数据,和audio_history的2.5ms基准值求绝对值之和,记录最小的位置,这样我们会得到新的三组峰值位置和最小误差值,分别记为best_distortion_index[0]~best_distortion_index[2],

best_distortion_w32[0]~best_distortion_w32[2].

为什么只取2.5ms数据,因为认为人的最小基音周期为2.5ms,相似性只用比较2.5ms数据即可。

用相关和误差综合考虑,相关值 / 误差值作为评价标准,在这三组index中选取一个最优的,记为best_index,最佳比值记为best_ratio。

distortion_lag为根据相似性得出的基音周期,correlation_lag为根据相关性得到的基音周期,二者相差很小,在0.5ms之内,max_lag_为二者的最大值。

size_t distortion_lag = best_distortion_index[best_index];
size_t correlation_lag = best_correlation_index[best_index];
max_lag_ = std::max(distortion_lag, correlation_lag);

保存ChannelParamers

struct ChannelParameters {
    ChannelParameters();
    int16_t mute_factor;
    int16_t ar_filter[kUnvoicedLpcOrder + 1];
    int16_t ar_filter_state[kUnvoicedLpcOrder];
    int16_t ar_gain;
    int16_t ar_gain_scale;
    int16_t voice_mix_factor;         /* Q14 */
    int16_t current_voice_mix_factor; /* Q14 */
    AudioVector *expand_vector0;
    AudioVector *expand_vector1;
    bool onset;
    int mute_slope; /* Q20 */
};

前面计算主要是为了计算基音周期,说到底是为了得出ChannelParaters参数,这几个参数的含义如下:

  • mute_factor
  • ar_filter

AR 滤波器系数,大小为kUnVoicedLpcOrder + 1.

  • ar_filter_state

audio_history最后kUnVoicedLpcOrder个数据。

  • ar_gain
  • ar_gain_scale
  • voice_mix_factor

上一包语音信号中浊音信号所占语音信号(清音+浊音)的幅度比例。

  • current_voice_mix_factor

相对于清音,当前浊音所占的比例,和voice_mix_factor概念一样,只是它记录的是当前的比例。

  • expand_vector0和expand_vector1

详细请看"计算expand_vector0和expand_vector1”

expand_vector1和expand_vector0分别保存audio_history最后两个maxlag+ overlap_的数据

  • onset
  • muteslope

对每一个通道的参数计算步骤如下:

计算best_index

在distortion_lag和correlation_lag之间找到相关值最大的lag,记为best_index

correlation_length = std::max(std::min(distortion_lag + kMinLag, fs_mult_120),
                                        static_cast<size_t>(kMaxLag * fs_mult));
size_t start_index = std::min(distortion_lag, correlation_lag);
size_t correlation_lags = static_cast<size_t>(
    VXAUDIO_SPL_ABS_W16((distortion_lag - correlation_lag)) + 1);

这里因为计算量不大,也未做下采样处理,相关长度计算

原生的webrtc代码如下,相关长度范围是7.5ms ~ 15ms,60 * fs_mult对应7.5ms,fs_mult_120对应15ms。distortion_lag + 10中的10没有特殊含义,只是增加一点预留量。

 correlation_length = std::max(std::min(distortion_lag + 10, fs_mult_120),
                                static_cast<size_t>(60 * fs_mult));

计算信号能量及相关系数

int32_t energy1 = VxAudioSpl_DotProductWithScale(
    &(audio_history[signal_length - correlation_length]),
    &(audio_history[signal_length - correlation_length]),
    correlation_length, correlation_scale);
int32_t energy2 = VxAudioSpl_DotProductWithScale(
    &(audio_history[signal_length - correlation_length - best_index]),
    &(audio_history[signal_length - correlation_length - best_index]),
    correlation_length, correlation_scale);

// Calculate the correlation coefficient between the two portions of the
// signal.
int32_t corr_coefficient;
if ((energy1 > 0) && (energy2 > 0)) {
    int energy1_scale = std::max(16 - VxAudioSpl_NormW32(energy1), 0);
    int energy2_scale = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
    // Make sure total scaling is even (to simplify scale factor after sqrt).
    if ((energy1_scale + energy2_scale) & 1) {
        // If sum is odd, add 1 to make it even.
        energy1_scale += 1;
    }
    int32_t scaled_energy1 = energy1 >> energy1_scale;
    int32_t scaled_energy2 = energy2 >> energy2_scale;
    int16_t sqrt_energy_product = static_cast<int16_t>(
        VxAudioSpl_SqrtFloor(scaled_energy1 * scaled_energy2));
    // Calculate max_correlation / sqrt(energy1 * energy2) in Q14.
    int cc_shift = 14 - (energy1_scale + energy2_scale) / 2;
    max_correlation = VXAUDIO_SPL_SHIFT_W32(max_correlation, cc_shift);
    corr_coefficient =
        VxAudioSpl_DivW32W16(max_correlation, sqrt_energy_product);
    // Cap at 1.0 in Q14.
    corr_coefficient = std::min(16384, corr_coefficient);
} else {
    corr_coefficient = 0;
}

计算expand_vector0和expand_vector1

这里分为两种情况,前面计算出了相邻两个基音周期的能量,energy1和energy2,这两种情况根据energy1和energy2的相对大小划分两类。

  • energy2 / 4<energy1 < 4 * energy2
if ((energy1 / 4 < energy2) && (energy1 > energy2 / 4)) {
    // Energy constraint fulfilled. Use both vectors and scale them
    // accordingly.
    int32_t scaled_energy2 = std::max(16 - VxAudioSpl_NormW32(energy2), 0);
    int32_t scaled_energy1 = scaled_energy2 - 13;
    // Calculate scaled_energy1 / scaled_energy2 in Q13.
    int32_t energy_ratio =
        VxAudioSpl_DivW32W16(VXAUDIO_SPL_SHIFT_W32(energy1, -scaled_energy1),
                             static_cast<int16_t>(energy2 >> scaled_energy2));
    // Calculate sqrt ratio in Q13 (sqrt of en1/en2 in Q26).
    amplitude_ratio =
        static_cast<int16_t>(VxAudioSpl_SqrtFloor(energy_ratio << 13));
    // Copy the two vectors and give them the same energy.
    parameters.expand_vector0->Clear();
    parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t *>(vector1), expansion_length);
    parameters.expand_vector1->Clear();
    if (parameters.expand_vector1->Size() < expansion_length) {
        parameters.expand_vector1->Extend(expansion_length -
                                          parameters.expand_vector1->Size());
    }
    std::unique_ptr<int16_t[]> temp_1(new int16_t[expansion_length]);
    VxAudioSpl_AffineTransformVector(
        temp_1.get(), const_cast<int16_t *>(vector2), amplitude_ratio, 4096,
        13, expansion_length);
    parameters.expand_vector1->OverwriteAt(reinterpret_cast<const int8_t*>(temp_1.get()), expansion_length, 0);
}

 size_t expansion_length = max_lag_ + overlap_length_;

expand_vector0存放audio_history最后expansion_length个sample个数据,expansion_length在max_lag_基础上又多加了overlap_length_个samples(5 samples @ 8kHz)。

amplitude_ratio为从energy2到energy1的幅度增益,即相邻两个基音周期的幅度增益。expand_vector1存放的是expand_vector0的前一个基音周期数据乘以amplitude。

目的是使相邻两个基音周期幅度尽量一致。

  • 除第一种情况下的其它情况

这种情况下energy1 < energy2 / 4 or energy1 > 4 * energy2

else {
    // Energy change constraint not fulfilled. Only use last vector.
    parameters.expand_vector0->Clear();
    parameters.expand_vector0->PushBack(reinterpret_cast<const int8_t*>(vector1), expansion_length);
    // Copy from expand_vector0 to expand_vector1.
    parameters.expand_vector0->CopyTo(parameters.expand_vector1);
    // Set the energy_ratio since it is used by muting slope.
    if ((energy1 / 4 < energy2) || (energy2 == 0)) {
        amplitude_ratio = 4096;  // 0.5 in Q13.
    } else {
        amplitude_ratio = 16384;  // 2.0 in Q13.
    }
}

expand_vector0和expand_vector1的值一样,都存放audio_history的最后expansion_length个samples。

并计算amplitude_ratio值,当energy1相比energy2太小时,设为0.5;相反,当energy1相比energy2太大时,设为2.0。

设置3个lag值

结果保存在expand_lags数组中,做为三个备选的基音周期,这个三个值相差很小,为什么这么做不得而知。

if (distortion_lag == correlation_lag) {
    expand_lags_[0] = distortion_lag;
    expand_lags_[1] = distortion_lag;
    expand_lags_[2] = distortion_lag;
} else {
    // |distortion_lag| and |correlation_lag| are not equal; use different
    // combinations of the two.
    // First lag is |distortion_lag| only.
    expand_lags_[0] = distortion_lag;
    // Second lag is the average of the two.
    expand_lags_[1] = (distortion_lag + correlation_lag) / 2;
    // Third lag is the average again, but rounding towards |correlation_lag|.
    if (distortion_lag > correlation_lag) {
        expand_lags_[2] = (distortion_lag + correlation_lag - 1) / 2;
    } else {
        expand_lags_[2] = (distortion_lag + correlation_lag + 1) / 2;
    }
}

计算AR滤波器系数

用莱文逊杜宾算法,结果保存在parameters.ar_filters中。

// Calculate the LPC and the gain of the filters.

// Calculate kUnvoicedLpcOrder + 1 lags of the auto-correlation function.
size_t temp_index =
    signal_length - fs_mult_lpc_analysis_len - kUnvoicedLpcOrder;
// Copy signal to temporary vector to be able to pad with leading zeros.
int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
       &audio_history[temp_index + kUnvoicedLpcOrder],
       sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);
delete[] temp_signal;

// Verify that variance is positive.
if (auto_correlation[0] > 0) {
    // Estimate AR filter parameters using Levinson-Durbin algorithm;
    // kUnvoicedLpcOrder + 1 filter coefficients.
    int16_t stability =
        VxAudioSpl_LevinsonDurbin(auto_correlation, parameters.ar_filter,
                                  reflection_coeff, kUnvoicedLpcOrder);

    // Keep filter parameters only if filter is stable.
    if (stability != 1) {
        MLOGW("LevinsonDurbin is unstable.");
        // Set first coefficient to 4096 (1.0 in Q12).
        parameters.ar_filter[0] = 4096;
        // Set remaining |kUnvoicedLpcOrder| coefficients to zero.
        VxAudioSpl_MemSetW16(parameters.ar_filter + 1, 0, kUnvoicedLpcOrder);
    }
}

当用莱文逊杜宾算法得出的结果不稳定,赋固定值,莱文逊杜宾算法的相关长度为fs_mult * kLpcAnalysisLength,即20ms数据。

以下代码为什么temp_signal前面kUnvoicedLpcOrder数据清0

int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal[kUnvoicedLpcOrder],
       &audio_history[temp_index + kUnvoicedLpcOrder],
       sizeof(int16_t) * fs_mult_lpc_analysis_len);
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);

个人觉得应该改为:

int16_t *temp_signal =
    new int16_t[fs_mult_lpc_analysis_len + kUnvoicedLpcOrder];
memset(temp_signal, 0,
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder));
memcpy(&temp_signal,
       &audio_history[temp_index],
       sizeof(int16_t) * (fs_mult_lpc_analysis_len + kUnvoicedLpcOrder ));
CrossCorrelationWithAutoShift(
    &temp_signal[kUnvoicedLpcOrder], &temp_signal[kUnvoicedLpcOrder],
    fs_mult_lpc_analysis_len, kUnvoicedLpcOrder + 1, -1, auto_correlation);

虽然二者对结果影响不大,另外这里应该用audio_history的清音数据,因为AR模型是用于清音数据,没明白为什么这么做?

随机噪声生成

噪声从RandomVecotor::kRandomTable中提取,这是为了后面用于生成清音数据,作用和非第一次Expand中的"生成随机噪声"作用一样。

if (channel_ix == 0) {
    // Extract a noise segment.
    size_t noise_length;
    if (distortion_lag < 40) {
        noise_length = 2 * distortion_lag + 30;
    } else {
        noise_length = distortion_lag + 30;
    }
    if (noise_length <= RandomVector::kRandomTableSize) {
        memcpy(random_vector, RandomVector::kRandomTable,
               sizeof(int16_t) * noise_length);
    } else {
        // Only applies to SWB where length could be larger than
        // |kRandomTableSize|.
        memcpy(random_vector, RandomVector::kRandomTable,
               sizeof(int16_t) * RandomVector::kRandomTableSize);
        assert(noise_length <= kMaxSampleRate / 8000 * 120 + 30);
        random_vector_->IncreaseSeedIncrement(2);
        random_vector_->Generate(
            noise_length - RandomVector::kRandomTableSize,
            &random_vector[RandomVector::kRandomTableSize]);
    }
}

下图为RandomVector::kRandomTable值

保存ar_filter_state

memcpy(parameters.ar_filter_state,
       &(audio_history[signal_length - kUnvoicedLpcOrder]),
       sizeof(int16_t) * kUnvoicedLpcOrder);

把audio_history的最后kUnVoicedLpcOrder个samples赋给ar_filter_state,如下图所示。

ar_filter_state的作用是后面估计清音数据数据时,把ar_filter_state数据拷贝到unvoiced_vector前面。

保存voice_mix_factor

// Calculate voice_mix_factor from corr_coefficient.
// Let x = corr_coefficient. Then, we compute:
// if (x > 0.48)
//   voice_mix_factor = (-5179 + 19931x - 16422x^2 + 5776x^3) / 4096;
// else
//   voice_mix_factor = 0;
if (corr_coefficient > 7875) {
    int16_t x1, x2, x3;
    // |corr_coefficient| is in Q14.
    x1 = static_cast<int16_t>(corr_coefficient);
    x2 = (x1 * x1) >> 14;  // Shift 14 to keep result in Q14.
    x3 = (x1 * x2) >> 14;
    static const int kCoefficients[4] = {-5179, 19931, -16422, 5776};
    int32_t temp_sum = kCoefficients[0] * 16384;
    temp_sum += kCoefficients[1] * x1;
    temp_sum += kCoefficients[2] * x2;
    temp_sum += kCoefficients[3] * x3;
    parameters.voice_mix_factor =
        static_cast<int16_t>(std::min(temp_sum / 4096, 16384));
    parameters.voice_mix_factor =
        std::max(parameters.voice_mix_factor, static_cast<int16_t>(0));
} else {
    parameters.voice_mix_factor = 0;
}

如果相关系数大于0.48,认为前后两个基音周期相关性较强,可以理解为语音信号中大多数是浊音数据,所以用相关系数经过3次曲线拟合浊音信号所占总的语音信号(清音+浊音)的比例;否则认为前后两个基音周期相关性不强,都为清音信号,voice_mix_factor设为0.

保存mute_slope

计算方法见下面代码

分两种情况

  • slope > 1.5

mute_slope定义为mute factor 在distortion_lag内从1.0 减小到1 / slope的斜率。

当slope > 1.8 时,斜率再除以2;当1.5<slope <= 1.8时,斜率再除以8.

  • 其它

slope <= 1.5,mute_slope定义为mute_factor在distortion_log从1.0减小到slope的斜率。

// Calculate muting slope. Reuse value from earlier scaling of
// |expand_vector0| and |expand_vector1|.
int16_t slope = amplitude_ratio;
if (slope > 12288) {
    // slope > 1.5.
    // Calculate (1 - (1 / slope)) / distortion_lag =
    // (slope - 1) / (distortion_lag * slope).
    // |slope| is in Q13, so 1 corresponds to 8192. Shift up to Q25 before
    // the division.
    // Shift the denominator from Q13 to Q5 before the division. The result of
    // the division will then be in Q20.
    int16_t denom = saturated_cast<int16_t>((distortion_lag * slope) >> 8);
    int temp_ratio = VxAudioSpl_DivW32W16((slope - 8192) << 12, denom);
    if (slope > 14746) {
        // slope > 1.8.
        // Divide by 2, with proper rounding.
        parameters.mute_slope = (temp_ratio + 1) / 2;
    } else {
        // Divide by 8, with proper rounding.
        parameters.mute_slope = (temp_ratio + 4) / 8;
    }
    parameters.onset = true;
} else {
            // Calculate (1 - slope) / distortion_lag.
            // Shift |slope| by 7 to Q20 before the division. The result is in Q20.
            parameters.mute_slope = VxAudioSpl_DivW32W16(
                    (8192 - slope) * 128, static_cast<int16_t>(distortion_lag));
            if (parameters.voice_mix_factor <= 13107) { // corrsponding to 0.8
                // Make sure the mute factor decreases from 1.0 to 0.9 in no more than
                // 6.25 ms.
                // mute_slope >= 0.005 / fs_mult in Q20.
                parameters.mute_slope = std::max(static_cast<int>(5243 / fs_mult), parameters.mute_slope);
            } else if (slope > 8028) { // corssponding to 0.98
                parameters.mute_slope = 0;
            }
            parameters.onset = false;
}

特别地,说明下为什么

// Make sure the mute factor decreases from 1.0 to 0.9 in no more than
// 6.25 ms.
// mute_slope >= 0.005 / fs_mult in Q20.

如果假设采样率为8kHz,则fs_mult =1 ,6.25ms对应50个samples。所以从1.0下降到0.9的斜率为(1.0 - 0.9) / 50 = 0.002,这里代码或者注释有误。要么注释应该是从1.0降到0.75,要么代码中5243改为2097.

在Expand其它代码地方有如下注释和代码,所以确认是这里笔误了。

if (consecutive_expands_ == 7) {
  // Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
  // mute_slope = 0.0020 / fs_mult in Q20.
  parameters.mute_slope = std::max(parameters.mute_slope, 2097 / fs_mult);
}

其它处理

剩下的处理过程和非第一次Expand过程"生成随机噪声"之后的处理,详见"非第一次Expand处理中的相关处理流程"。

非第一次Expand

生成随机噪声

生成长度为max_lag_的随机噪声,生成随机噪声是为了后面AR滤波器生成清音数据。

 size_t rand_length = max_lag_;
// This only applies to SWB where length could be larger than 256.
assert(rand_length <= kMaxSampleRate / 8000 * 120 + 30);
GenerateRandomVector(2, rand_length, random_vector);

更新lag index

 current_lag_index_ = current_lag_index_ + lag_index_direction_;
    // Change direction if needed.
    if (current_lag_index_ <= 0) {
        lag_index_direction_ = 1;
    }
    if (current_lag_index_ >= kNumLags - 1) {
        lag_index_direction_ = -1;
    }

获取浊音数据

对每个通道,根据current_lag_index_获取浊音数据,浊音数据主要从expand_vector0和expand_vector0中获取。

如果current_lag_index = 0,浊音数据和expand_vector0一致;如果current_lag_index = 1,浊音数据为

3/ 4 * expand_vector0 + 1 / 4 * expand_vector1;如果current_lag_index =2,浊音数据为 1 /2 * expand_vector0 + 1 /2 * expand_vector1。

结果存放在voiced_vector_storage中。

为什么这么做?

if (current_lag_index_ == 0) {
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(voiced_vector_storage));
} else if (current_lag_index_ == 1) {
    std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_0.get()));
    std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
    parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t *>(temp_1.get()));
    // Mix 3/4 of expand_vector0 with 1/4 of expand_vector1.
    VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 3, temp_1.get(), 1, 2,
                                           voiced_vector_storage, temp_length);
} else if (current_lag_index_ == 2) {

    std::unique_ptr<int16_t[]> temp_0(new int16_t[temp_length]);
    parameters.expand_vector0->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_0.get()));
    std::unique_ptr<int16_t[]> temp_1(new int16_t[temp_length]);
    parameters.expand_vector1->CopyTo(temp_length, expansion_vector_position,
                                      reinterpret_cast<int8_t*>(temp_1.get()));
    VxAudioSpl_ScaleAndAddVectorsWithRound(temp_0.get(), 1, temp_1.get(), 1, 1,
                                           voiced_vector_storage, temp_length);
}

syncBuffer中overlap数据平滑

当mute_factor大于0.05并且current_voice_mix_factor大于0.5时,这时认为语音幅度下降需要一些时间,且浊音占的比较大于50%,平滑sync_buffer中overlap数据,用voiced_vector和sync_buffer中本身的overlap数据加权。

// Smooth the expanded if it has not been muted to a low amplitude and
    // |current_voice_mix_factor| is larger than 0.5.
    if ((parameters.mute_factor > 819) &&
        (parameters.current_voice_mix_factor > 8192)) {
      size_t start_ix = sync_buffer_->Size() - overlap_length_;
      for (size_t i = 0; i < overlap_length_; i++) {
        // Do overlap add between new vector and overlap.
        (*sync_buffer_)[channel_ix][start_ix + i] =
            (((*sync_buffer_)[channel_ix][start_ix + i] * muting_window) +
             (((parameters.mute_factor * voiced_vector_storage[i]) >> 14) *
              unmuting_window) +
             16384) >>
            15;
        muting_window += muting_window_increment;
        unmuting_window += unmuting_window_increment;
      }
    } else if (parameters.mute_factor == 0) {
      // The expanded signal will consist of only comfort noise if
      // mute_factor = 0. Set the output length to 15 ms for best noise
      // production.
      // TODO(hlundin): This has been disabled since the length of
      // parameters.expand_vector0 and parameters.expand_vector1 no longer
      // match with expand_lags_, causing invalid reads and writes. Is it a good
      // idea to enable this again, and solve the vector size problem?
      //      max_lag_ = fs_mult * 120;
      //      expand_lags_[0] = fs_mult * 120;
      //      expand_lags_[1] = fs_mult * 120;
      //      expand_lags_[2] = fs_mult * 120;
    }

获取清音数据

// Unvoiced part.
// Filter |scaled_random_vector| through |ar_filter_|.
memcpy(unvoiced_vector - kUnvoicedLpcOrder, parameters.ar_filter_state,
       sizeof(int16_t) * kUnvoicedLpcOrder);
int32_t add_constant = 0;
if (parameters.ar_gain_scale > 0) {
    add_constant = 1 << (parameters.ar_gain_scale - 1);
}
VxAudioSpl_AffineTransformVector(scaled_random_vector, random_vector,
                                 parameters.ar_gain, add_constant,
                                 parameters.ar_gain_scale, current_lag);
VxAudioSpl_FilterARFastQ12(scaled_random_vector, unvoiced_vector,
                           parameters.ar_filter, kUnvoicedLpcOrder + 1,
                           current_lag);
memcpy(parameters.ar_filter_state,
       &(unvoiced_vector[current_lag - kUnvoicedLpcOrder]),
       sizeof(int16_t) * kUnvoicedLpcOrder);

先由random_vector经过线性变换得到scaled_random_vector,scaled_random_vector经过AR 滤波器得到清音数据unvoiced_vector。

同时,parameters.ar_filter_state也更新。

清音数据和浊音数据混合

// Combine voiced and unvoiced contributions.

// Set a suitable cross-fading slope.
// For lag =
//   <= 31 * fs_mult            => go from 1 to 0 in about 8 ms;
//  (>= 31 .. <= 63) * fs_mult  => go from 1 to 0 in about 16 ms;
//   >= 64 * fs_mult            => go from 1 to 0 in about 32 ms.
// temp_shift = getbits(max_lag_) - 5.
int temp_shift =
    (31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;
int16_t mix_factor_increment = 256 >> temp_shift;
if (stop_muting_) {
    mix_factor_increment = 0;
}

// Create combined signal by shifting in more and more of unvoiced part.
temp_shift = 8 - temp_shift;  // = getbits(mix_factor_increment).
size_t temp_length =
    (parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
    temp_shift;
temp_length = std::min(temp_length, current_lag);
DspHelper::CrossFade(voiced_vector, unvoiced_vector, temp_length,
                     &parameters.current_voice_mix_factor,
                     mix_factor_increment, temp_data);

首先需要计算出需要融合的清音和浊音的长度即temp_length,它的思想大致是这样的:

  1. max_lags_为对应4kHz采样率的lag,所以lag <= 31 * fs_mult,基因周期小于8ms;

lag >=31 && lag <=63,基因周期在8ms~16ms;

lag>=64,基因周期大于16ms~32ms. (人的基音周期范围也就是2.5ms~16ms左右)

2. temp_shift = getbits(max_lags_) - 5 ,为什么max_lags所占的bit数减5,为什么要减5?感觉是为了定点化处理,具体原因没看出来??

int temp_shift =
    (31 - VxAudioSpl_NormW32(dchecked_cast<int32_t>(max_lag_))) - 5;

3.认为voice_mix_factor是呈线性变化的,计算从上一次voice_mix_factor到这次的长度,即清音和浊音融合的长度。

size_t temp_length =
    (parameters.current_voice_mix_factor - parameters.voice_mix_factor) >>
    temp_shift;

parameters.current_voice_mix_factor为当前浊音所占的比重,temp_length为浊音和清音需要融合的长度,最后的结果保存在temp_data中。

为什么expand中有overlap个数据,这是为了防止浊音数据和静音数据融合时减弱起始位置的影响。

如上述示意图,voiced_vector_storage取audio_history后面current_lag + overlap_length_数据,unvoiced_vector为current_lag长度的清音数据,清音数据和浊音数据融合跳过了浊音前面overlap_length个数据

当融合的数据长度小于current_lag时,需要对未融合的数据做处理。未融合的数据用voiced_vector和unvoiced_vector中未融合的数据加权求得。

// End of cross-fading period was reached before end of expanded signal
// path. Mix the rest with a fixed mixing factor.
if (temp_length < current_lag) {
    if (mix_factor_increment != 0) {
        parameters.current_voice_mix_factor = parameters.voice_mix_factor;
    }
    int16_t temp_scale = 16384 - parameters.current_voice_mix_factor;
    VxAudioSpl_ScaleAndAddVectorsWithRound(
        voiced_vector + temp_length, parameters.current_voice_mix_factor,
        unvoiced_vector + temp_length, temp_scale, 14,
        temp_data + temp_length, current_lag - temp_length);
}

更新muting slope

更新的依据是处理了多少次连续的expand,更新算法待研究

// Select muting slope depending on how many consecutive expands we have
// done.
if (consecutive_expands_ == 3) {
    // Let the mute factor decrease from 1.0 to 0.95 in 6.25 ms.
    // mute_slope = 0.0010 / fs_mult in Q20.
    parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(1049 / fs_mult));
}
if (consecutive_expands_ == 7) {
    // Let the mute factor decrease from 1.0 to 0.90 in 6.25 ms.
    // mute_slope = 0.0020 / fs_mult in Q20.
    parameters.mute_slope = std::max(parameters.mute_slope, static_cast<int>(2097 / fs_mult));
}

// Mute segment according to slope value.
if ((consecutive_expands_ != 0) || !parameters.onset) {
    // Mute to the previous level, then continue with the muting.
    VxAudioSpl_AffineTransformVector(
        temp_data, temp_data, parameters.mute_factor, 8192, 14, current_lag);

    if (!stop_muting_) {
        DspHelper::MuteSignal(temp_data, parameters.mute_slope, current_lag);

        // Shift by 6 to go from Q20 to Q14.
        // TODO(hlundin): Adding 8192 before shifting 6 steps seems wrong.
        // Legacy.
        int16_t gain = static_cast<int16_t>(
            16384 - (((current_lag * parameters.mute_slope) + 8192) >> 6));
        gain = ((gain * parameters.mute_factor) + 8192) >> 14;

        // Guard against getting stuck with very small (but sometimes audible)
        // gain.
        if ((consecutive_expands_ > 3) && (gain >= parameters.mute_factor)) {
            parameters.mute_factor = 0;
        } else {
            parameters.mute_factor = gain;
        }
    }
}

生成背景噪音

 // Background noise part.
GenerateBackgroundNoise(
    random_vector, channel_ix, channel_parameters_[channel_ix].mute_slope,
    TooManyExpands(), current_lag, unvoiced_array_memory);

结果存放在unvoiced_array_memory + kNoiseLpcOrder中。

把背景噪声加到添加temp_data,结果写到algorithm_buffer中

// Add background noise to the combined voiced-unvoiced signal.
for (size_t i = 0; i < current_lag; i++) {
    temp_data[i] = temp_data[i] + noise_vector[i];
}
if (channel_ix == 0) {
    output->AssertSize(current_lag);
} else {
    assert(output->Size() == current_lag);
}
(*output)[channel_ix].OverwriteAt(reinterpret_cast<const int8_t *>(temp_data), current_lag, 0);

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Neil_baby

你的鼓励是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值