mfcc代码调用 Java_Kaldi的MFCC特征提取代码分析

最新推荐文章于 2022-08-31 22:35:28 发布

weixin_39614657

最新推荐文章于 2022-08-31 22:35:28 发布

阅读量402

点赞数

文章标签： mfcc代码调用 Java

本文链接：https://blog.csdn.net/weixin_39614657/article/details/115060863

版权

本文介绍featbin/compute-mfcc-feats.cc的代码。对MFCC不了解的读者可以参考MFCC特征提取和Kaldi文档解读：Kaldi的特征提取。更多本系列文章请点击Kaldi代码分析。

运行

我们首先看一看这个函数的参数：

kaldi/src/featbin$ compute-mfcc-feats

./compute-mfcc-feats

Create MFCC feature files.

Usage: compute-mfcc-feats [options...]

Options:

--allow-downsample : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)

--allow-upsample : If true, allow the input waveform to have a lower frequency than the specified --sample-frequency (and we'll upsample). (bool, default = false)

--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)

--cepstral-lifter : Constant that controls scaling of MFCCs (float, default = 22)

--channel : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)

--debug-mel : Print out debugging information for mel bin computation (bool, default = false)

--dither : Dithering constant (0.0 means no dither). If you turn this off, you should set the --energy-floor option, e.g. to 1.0 or 0.1 (float, default = 1)

--energy-floor : Floor on energy (absolute, not relative) in MFCC computation. Only makes a difference if --use-energy=true; only necessary if --dither=0.0. Suggested values: 0.1 or 1.0 (float, default = 0)

--frame-length : Frame length in milliseconds (float, default = 25)

--frame-shift : Frame shift in milliseconds (float, default = 10)

--high-freq : High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

--htk-compat : If true, put energy or C0 last and use a factor of sqrt(2) on C0. Warning: not sufficient to get HTK compatible features (need to change other parameters). (bool, default = false)

--low-freq : Low cutoff frequency for mel bins (float, default = 20)

--max-feature-vectors : Memory optimization. If larger than 0, periodically remove feature vectors so that only this number of the latest feature vectors is retained. (int, default = -1)

--min-duration : Minimum duration of segments to process (in seconds). (float, default = 0)

--num-ceps : Number of cepstra in MFCC computation (including C0) (int, default = 13)

--num-mel-bins : Number of triangular mel-frequency bins (int, default = 23)

--output-format : Format of the output files [kaldi, htk] (string, default = "kaldi")

--preemphasis-coefficient : Coefficient for use in signal preemphasis (float, default = 0.97)

--raw-energy : If true, compute energy before preemphasis and windowing (bool, default = true)

--remove-dc-offset : Subtract mean from waveform on each frame (bool, default = true)

--round-to-power-of-two : If true, round window size to power of two by zero-padding input to FFT. (bool, default = true)

--sample-frequency : Waveform data sample frequency (must match the waveform file, if specified there) (float, default = 16000)

--snip-edges : If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. (bool, default = true)

--subtract-mean : Subtract mean of each feature file [CMS]; not recommended to do it this way. (bool, default = false)

--use-energy : Use energy (not C0) in MFCC computation (bool, default = true)

--utt2spk : Utterance to speaker-id map rspecifier (if doing VTLN and you have warps per speaker) (string, default = "")

--vtln-high : High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (float, default = -500)

--vtln-low : Low inflection point in piecewise linear VTLN warping function (float, default = 100)

--vtln-map : Map from utterance or speaker-id to vtln warp factor (rspecifier) (string, default = "")

--vtln-warp : Vtln warp factor (only applicable if vtln-map not specified) (float, default = 1)

--window-type : Type of window ("hamming"|"hanning"|"povey"|"rectangular"|"blackmann") (string, default = "povey")

--write-utt2dur : Wspecifier to write duration of each utterance in seconds, e.g. 'ark,t:utt2dur'. (string, default = "")

Standard options:

--config : Configuration file to read (this option may be repeated) (string, default = "")

--help : Print out usage message (bool, default = false)

--print-args : Print the command line arguments (to stderr) (bool, default = true)

--verbose : Verbose level (higher->more logging) (int, default = 0)

很多参数，但是必须的参数只有两个：输入的wav-rspecifier和输出的feats-wspecifier。不了解rspecifier和wspecifier的读者请先阅读Kaldi文档解读和Kaldi I/O mechanisms(这个文档目前还没来得及翻译)。我们首先需要准备输入的文件：

$ cat wav.scp

1088-134315-0000 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0000.flac |

1088-134315-0001 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0001.flac |

1088-134315-0002 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0002.flac |

compute-mfcc-feats只能读取WAV格式的数据，其它的格式需要转换成WAV格式。转换可以”离线”的方式提前用工具转好，也可以on-the-fly的用命令行工具实现，比如我上面的例子是mini-librispeech的数据，它是flac格式的，可以使用flac工具on-the-fly的转好后通过管道传给Kaldi。

下面我们来运行一下(为了避免刷屏，我只显示前10行)：

$ compute-mfcc-feats scp:wav.scp ark,t:-|head

compute-mfcc-feats scp:wav.scp ark,t:-

1088-134315-0000 [

15.80592 -19.65983 -44.14267 -10.36022 -21.32767 -3.718761 -16.35019 -2.041183 1.383831 -6.504814 -15.53979 4.017059 -1.488521

15.80938 -19.18601 -36.05476 1.974784 -16.47062 9.139902 -17.07959 -2.513923 12.35208 0.3189384 -16.05096 -10.61343 -5.74498

15.56385 -20.34504 -41.09424 -5.030499 -14.83514 7.03503 -8.041247 3.885094 17.55143 -2.833649 -17.50364 -7.040634 -2.903327

15.47452 -18.549 -31.8735 -3.966199 -19.83389 -2.875786 -18.66935 5.130469 9.293931 -11.26405 -14.64669 -5.246292 3.741609

15.60467 -18.61004 -35.10693 1.910538 -24.71902 0.7080836 -16.67995 9.620322 5.798228 -7.07972 -12.95449 7.126602 11.80375

15.79984 -17.4576 -41.98382 -7.244875 -20.32612 4.706924 -15.74146 4.090664 9.880718 2.285589 -9.086823 4.295952 11.07193

15.91331 -13.78769 -35.16341 -2.869465 -16.56114 6.949731 -3.440318 -0.4015923 1.338336 -13.04577 -10.98161 -0.4763947 9.889109

16.03437 -11.54739 -36.76908 -4.536423 -24.24585 9.95239 -15.34235 -0.2514563 19.19995 5.409452 -2.524536 4.166977 10.39466

16.39969 -8.784815 -41.10754 -6.428618 -22.93435 2.602599 -15.65443 -2.352367 8.833455 -6.360246 -14.28245 -1.446442 2.700707

上面的输出类型是t(文本)，并且用”-“表示输出到标准输出(屏幕)，这样便于查看。

调试

阅读代码最好的方式就是单步执行查看变量，调试Kaldi的代码请参考Kaldi教程(三)的代码调试部分。

注意编译时要把kaldi.mk里的-O1去掉改成”-O0 -DKALDI_PARANOID”，具体参考上面的文档。

compute-mfcc-feats.cc

这个文件是提取mfcc的工具，包括入口的main函数，它只是做参数的parse，真正干活的代码不多，这里只保留主要的代码。

int main(int argc, char *argv[]) {

try {

ParseOptions po(usage);

MfccOptions mfcc_opts;

// Register the MFCC option struct.

mfcc_opts.Register(&po);

// Register the options.

po.Register("output-format", &output_format, "Format of the output "

"files [kaldi, htk]");

std::string wav_rspecifier = po.GetArg(1);

std::string output_wspecifier = po.GetArg(2);

Mfcc mfcc(mfcc_opts);

SequentialTableReader reader(wav_rspecifier);

BaseFloatMatrixWriter kaldi_writer; // typedef to TableWriter.

int32 num_utts = 0, num_success = 0;

for (; !reader.Done(); reader.Next()) {

num_utts++;

std::string utt = reader.Key();

const WaveData &wave_data = reader.Value();

SubVector waveform(wave_data.Data(), this_chan);

Matrix features;

try {

mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),

vtln_warp_local, &features);

} catch (...) {

}

if (subtract_mean) {

Vector mean(features.NumCols());

mean.AddRowSumMat(1.0, features);

mean.Scale(1.0 / features.NumRows());

for (int32 i = 0; i < features.NumRows(); i++)

features.Row(i).AddVec(-1.0, mean);

}

if (output_format == "kaldi") {

kaldi_writer.Write(utt, features);

} else {

}

if (utt2dur_writer.IsOpen()) {

utt2dur_writer.Write(utt, wave_data.Duration());

}

} catch(const std::exception &e) {

}

前面的parse参数和选项我们可以略过，”Mfcc mfcc(mfcc_opts);”，这是真正干活的。

接下来”SequentialTableReader reader(wav_rspecifier);”用于读取wav文件。然后是用for循环读取每一个wav文件。for循环里面是处理每一个wav文件的代码：

const WaveData &wave_data = reader.Value();

// 使用SubVector取wav文件的一个通道

SubVector waveform(wave_data.Data(), this_chan);

Matrix features;

try {

//使用mfcc提取特征。

mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),

vtln_warp_local, &features);

如果有subtract_mean，则同一个utterance的每一维特征都减去均值，最后使用”kaldi_writer.Write(utt, features);”输出。

WaveHolder读取WAV文件

我们是通过下面的代码来构造WAV文件的reader：

SequentialTableReader reader(wav_rspecifier);

这里使用SequentialTableReader，它会读取前面的wav.scp文件，但是具体怎么读取”flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0000.flac”输出的WAV文件，则需要靠WaveHolder了。

它读取WAV的代码是：

bool Read(std::istream &is) {

// We don't look for the binary-mode header here [always binary]

try {

t_.Read(is); // Throws exception on failure.

return true;

} catch (const std::exception &e) {

KALDI_WARN << "Exception caught in WaveHolder::Read(). " << e.what();

return false;

}

上面的t_是成员变量(Kaldi的coding style是私有成员以下划线结尾)，在WaveHolder的开始我们可以看到”typedef WaveData T;”，因此最终调用到WaveData类的Read()：

void WaveData::Read(std::istream &is) {

const uint32 kBlockSize = 1024 * 1024;

WaveInfo header;

header.Read(is);

data_.Resize(0, 0); // clear the data.

samp_freq_ = header.SampFreq();

std::vector buffer;

uint32 bytes_to_go = header.IsStreamed() ? kBlockSize : header.DataBytes();

// Once in a while header.DataBytes() will report an insane value;

// read the file to the end

while (is && bytes_to_go > 0) {

uint32 block_bytes = std::min(bytes_to_go, kBlockSize);

uint32 offset = buffer.size();

buffer.resize(offset + block_bytes);

is.read(&buffer[offset], block_bytes);

uint32 bytes_read = is.gcount();

buffer.resize(offset + bytes_read);

if (!header.IsStreamed())

bytes_to_go -= bytes_read;

}

uint16 *data_ptr = reinterpret_cast(&buffer[0]);

// The matrix is arranged row per channel, column per sample.

data_.Resize(header.NumChannels(),

buffer.size() / header.BlockAlign());

for (uint32 i = 0; i < data_.NumCols(); ++i) {

for (uint32 j = 0; j < data_.NumRows(); ++j) {

int16 k = *data_ptr++;

if (header.ReverseBytes())

KALDI_SWAP2(k);

data_(j, i) = k;

}

上面的代码显示读取WAV的Header信息，确保文件的采样率，比特率等和compute-mfcc-feats要求的(我们没有设置的话都有合理的默认值，这些默认值请查看看compute-mfcc-feats的帮助)是否一样。关于WAV格式，感兴趣的读者可以参考Digital Audio - Creating a WAV (RIFF) file和WAVE PCM soundfile format。读者可以对着WAV格式头部说明然后单步调试void WaveInfo::Read(std::istream &is)函数。

Mfcc

Mfcc是OfflineFeatureTpl模板类使用MfccComputer的typedef：

typedef OfflineFeatureTpl Mfcc;

OfflineFeatureTpl

我们先看一下OfflineFeatureTpl的文档

/// This templated class is intended for offline feature extraction, i.e. where

/// you have access to the entire signal at the start. It exists mainly to be

/// drop-in replacement for the old (pre-2016) classes Mfcc, Plp and so on, for

/// use in the offline case. In April 2016 we reorganized the online

/// feature-computation code for greater modularity and to have correct support

/// for the snip-edges=false option.

template

class OfflineFeatureTpl {

大意是：2016年前的代码是Mfcc和Plp这些具体的类，它们用于计算离线的特征，而之后重构了代码。

ComputeFeatures

我们来看计算特征的函数：

template

void OfflineFeatureTpl::ComputeFeatures(

const VectorBase &wave,

BaseFloat sample_freq,

BaseFloat vtln_warp,

Matrix *output) {

KALDI_ASSERT(output != NULL);

BaseFloat new_sample_freq = computer_.GetFrameOptions().samp_freq;

if (sample_freq == new_sample_freq) {

Compute(wave, vtln_warp, output);

} else {

if (new_sample_freq < sample_freq &&

! computer_.GetFrameOptions().allow_downsample)

KALDI_ERR << "Waveform and config sample Frequency mismatch: "

<< sample_freq << " .vs " << new_sample_freq

<< " (use --allow-downsample=true to allow "

<< " downsampling the waveform).";

else if (new_sample_freq > sample_freq &&

! computer_.GetFrameOptions().allow_upsample)

KALDI_ERR << "Waveform and config sample Frequency mismatch: "

<< sample_freq << " .vs " << new_sample_freq

<< " (use --allow-upsample=true option to allow "

<< " upsampling the waveform).";

// Resample the waveform.

Vector resampled_wave(wave);

ResampleWaveform(sample_freq, wave,

new_sample_freq, &resampled_wave);

Compute(resampled_wave, vtln_warp, output);

}

它其实只是检查从WAV头部读取的采样率和compute-mfcc-feats传入的是否一致，如果一致使用Compute函数计算，否则如果运行的话对WAV文件进行上采样或者下采样以便满足compute-mfcc-feats的要求，最终还是调用Compute函数。

Compute

template

void OfflineFeatureTpl::Compute(

const VectorBase &wave,

BaseFloat vtln_warp,

Matrix *output) {

//.......

int32 rows_out = NumFrames(wave.Dim(), computer_.GetFrameOptions()),

cols_out = computer_.Dim();

//.......

for (int32 r = 0; r < rows_out; r++) { // r is frame index.

BaseFloat raw_log_energy = 0.0;

ExtractWindow(0, wave, r, computer_.GetFrameOptions(),

feature_window_function_, &window,

(use_raw_log_energy ? &raw_log_energy : NULL));

SubVector output_row(*output, r);

computer_.Compute(raw_log_energy, vtln_warp, &window, &output_row);

}

首先使用NumFrames计算WAV有多少帧，然后遍历每一帧：使用ExtractWindow抽取每一帧，然后使用computer_.Compute提取特征。

NumFrames

目前默认的方式是snip_edges，也和HTK一致，也就是保证不需要padding，如果往后移动超出范围，那就不要了。代码这里就不介绍了。

ExtractWindow

我们只看一下函数的参数和说明。

// ExtractWindow extracts a windowed frame of waveform with a power-of-two,

// padded size. It does mean subtraction, pre-emphasis and dithering as

// requested.

void ExtractWindow(int64 sample_offset,

const VectorBase &wave,

int32 f, // with 0 <= f < NumFrames(feats, opts)

const FrameExtractionOptions &opts,

const FeatureWindowFunction &window_function,

Vector *window,

BaseFloat *log_energy_pre_window) {

因为要做FFT，所以要求采样点是2的幂，所以会在后面padding。此外如果FrameExtractionOptions需要减去均值或者pre-emphasis或者dithering(增加很小的随机噪声防止log为0)，都会在这里处理，最后会把这些点乘以窗口函数FeatureWindowFunction(比如Hamming窗)。

MfccComputer

最终到了干活的代码了，上面会调用它的Compute函数，请阅读注释：

void MfccComputer::Compute(BaseFloat signal_log_energy,

BaseFloat vtln_warp,

VectorBase *signal_frame,

VectorBase *feature) {

KALDI_ASSERT(signal_frame->Dim() == opts_.frame_opts.PaddedWindowSize() &&

feature->Dim() == this->Dim());

// 获取Mel FilterBank，为了复用，会把每一个VLTN的alpha作为key存在map里。

const MelBanks &mel_banks = *(GetMelBanks(vtln_warp));

if (opts_.use_energy && !opts_.raw_energy)

// 用向量向量乘法计算能量

signal_log_energy = Log(std::max(VecVec(*signal_frame, *signal_frame),

std::numeric_limits::epsilon()));

// FFT，默认是split-radix算法

if (srfft_ != NULL) // Compute FFT using the split-radix algorithm.

srfft_->Compute(signal_frame->Data(), true);

else // An alternative algorithm that works for non-powers-of-two.

RealFft(signal_frame, true);

// FFT得到的复数计算其模得到功率谱

ComputePowerSpectrum(signal_frame);

// 因为是实数，FFT只需要N/2+1个点。

SubVector power_spectrum(*signal_frame, 0,

signal_frame->Dim() / 2 + 1);

// 使用Filter bank滤波器组提取每个bin的能量

mel_banks.Compute(power_spectrum, &mel_energies_);

// 避免对零取log (如果有dithering那么不应该是零，但是dithering是可选的，所以保险一点还是要处理)

mel_energies_.ApplyFloor(std::numeric_limits::epsilon());

mel_energies_.ApplyLog(); // 取log

feature->SetZero(); // in case there were NaNs.

// feature = dct_matrix_ * mel_energies [which now have log]

// 进行DCT得到倒谱

feature->AddMatVec(1.0, dct_matrix_, kNoTrans, mel_energies_, 0.0);

if (opts_.cepstral_lifter != 0.0)

// 倒谱系数的lifting

feature->MulElements(lifter_coeffs_);

if (opts_.use_energy) {

// 如果使用能量，那么把这一帧的能量替换掉倒谱的第一个系数。

if (opts_.energy_floor > 0.0 && signal_log_energy < log_energy_floor_)

signal_log_energy = log_energy_floor_;

(*feature)(0) = signal_log_energy;

}

if (opts_.htk_compat) {

// HTK的能量参数放在最后面，为了便于对比，MfccOptions有一个htk_compat选项。

BaseFloat energy = (*feature)(0);

for (int32 i = 0; i < opts_.num_ceps - 1; i++)

(*feature)(i) = (*feature)(i+1);

if (!opts_.use_energy)

energy *= M_SQRT2; // scale on C0 (actually removing a scale

// we previously added that's part of one common definition of

// the cosine transform.)

(*feature)(opts_.num_ceps - 1) = energy;

}

代码其实很清晰，不了解的读者也可以参考MFCC特征提取和模块二：语音信号处理。这两篇文章都有Python的代码实现可以参考。

weixin_39614657

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫