mfcc代码调用 Java_Kaldi的MFCC特征提取代码分析

本文介绍featbin/compute-mfcc-feats.cc的代码。对MFCC不了解的读者可以参考MFCC特征提取和Kaldi文档解读:Kaldi的特征提取。更多本系列文章请点击Kaldi代码分析。

目录

运行

我们首先看一看这个函数的参数:

kaldi/src/featbin$ compute-mfcc-feats

./compute-mfcc-feats

Create MFCC feature files.

Usage: compute-mfcc-feats [options...]

Options:

--allow-downsample : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)

--allow-upsample : If true, allow the input waveform to have a lower frequency than the specified --sample-frequency (and we'll upsample). (bool, default = false)

--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)

--cepstral-lifter : Constant that controls scaling of MFCCs (float, default = 22)

--channel : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)

--debug-mel : Print out debugging information for mel bin computation (bool, default = false)

--dither : Dithering constant (0.0 means no dither). If you turn this off, you should set the --energy-floor option, e.g. to 1.0 or 0.1 (float, default = 1)

--energy-floor : Floor on energy (absolute, not relative) in MFCC computation. Only makes a difference if --use-energy=true; only necessary if --dither=0.0. Suggested values: 0.1 or 1.0 (float, default = 0)

--frame-length : Frame length in milliseconds (float, default = 25)

--frame-shift : Frame shift in milliseconds (float, default = 10)

--high-freq : High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

--htk-compat : If true, put energy or C0 last and use a factor of sqrt(2) on C0. Warning: not sufficient to get HTK compatible features (need to change other parameters). (bool, default = false)

--low-freq : Low cutoff frequency for mel bins (float, default = 20)

--max-feature-vectors : Memory optimization. If larger than 0, periodically remove feature vectors so that only this number of the latest feature vectors is retained. (int, default = -1)

--min-duration : Minimum duration of segments to process (in seconds). (float, default = 0)

--num-ceps : Number of cepstra in MFCC computation (including C0) (int, default = 13)

--num-mel-bins : Number of triangular mel-frequency bins (int, default = 23)

--output-format : Format of the output files [kaldi, htk] (string, default = "kaldi")

--preemphasis-coefficient : Coefficient for use in signal preemphasis (float, default = 0.97)

--raw-energy : If true, compute energy before preemphasis and windowing (bool, default = true)

--remove-dc-offset : Subtract mean from waveform on each frame (bool, default = true)

--round-to-power-of-two : If true, round window size to power of two by zero-padding input to FFT. (bool, default = true)

--sample-frequency : Waveform data sample frequency (must match the waveform file, if specified there) (float, default = 16000)

--snip-edges : If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. (bool, default = true)

--subtract-mean : Subtract mean of each feature file [CMS]; not recommended to do it this way. (bool, default = false)

--use-energy : Use energy (not C0) in MFCC computation (bool, default = true)

--utt2spk : Utterance to speaker-id map rspecifier (if doing VTLN and you have warps per speaker) (string, default = "")

--vtln-high : High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (float, default = -500)

--vtln-low : Low inflection point in piecewise linear VTLN warping function (float, default = 100)

--vtln-map : Map from utterance or speaker-id to vtln warp factor (rspecifier) (string, default = "")

--vtln-warp : Vtln warp factor (only applicable if vtln-map not specified) (float, default = 1)

--window-type : Type of window ("hamming"|"hanning"|"povey"|"rectangular"|"blackmann") (string, default = "povey")

--write-utt2dur : Wspecifier to write duration of each utterance in seconds, e.g. 'ark,t:utt2dur'. (string, default = "")

Standard options:

--config : Configuration file to read (this option may be repeated) (string, default = "")

--help : Print out usage message (bool, default = false)

--print-args : Print the command line arguments (to stderr) (bool, default = true)

--verbose : Verbose level (higher->more logging) (int, default = 0)

很多参数,但是必须的参数只有两个:输入的wav-rspecifier和输出的feats-wspecifier。不了解rspecifier和wspecifier的读者请先阅读Kaldi文档解读和Kaldi I/O mechanisms(这个文档目前还没来得及翻译)。我们首先需要准备输入的文件:

$ cat wav.scp

1088-134315-0000 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0000.flac |

1088-134315-0001 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0001.flac |

1088-134315-0002 flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0002.flac |

compute-mfcc-feats只能读取WAV格式的数据,其它的格式需要转换成WAV格式。转换可以”离线”的方式提前用工具转好,也可以on-the-fly的用命令行工具实现,比如我上面的例子是mini-librispeech的数据,它是flac格式的,可以使用flac工具on-the-fly的转好后通过管道传给Kaldi。

下面我们来运行一下(为了避免刷屏,我只显示前10行):

$ compute-mfcc-feats scp:wav.scp ark,t:-|head

compute-mfcc-feats scp:wav.scp ark,t:-

1088-134315-0000 [

15.80592 -19.65983 -44.14267 -10.36022 -21.32767 -3.718761 -16.35019 -2.041183 1.383831 -6.504814 -15.53979 4.017059 -1.488521

15.80938 -19.18601 -36.05476 1.974784 -16.47062 9.139902 -17.07959 -2.513923 12.35208 0.3189384 -16.05096 -10.61343 -5.74498

15.56385 -20.34504 -41.09424 -5.030499 -14.83514 7.03503 -8.041247 3.885094 17.55143 -2.833649 -17.50364 -7.040634 -2.903327

15.47452 -18.549 -31.8735 -3.966199 -19.83389 -2.875786 -18.66935 5.130469 9.293931 -11.26405 -14.64669 -5.246292 3.741609

15.60467 -18.61004 -35.10693 1.910538 -24.71902 0.7080836 -16.67995 9.620322 5.798228 -7.07972 -12.95449 7.126602 11.80375

15.79984 -17.4576 -41.98382 -7.244875 -20.32612 4.706924 -15.74146 4.090664 9.880718 2.285589 -9.086823 4.295952 11.07193

15.91331 -13.78769 -35.16341 -2.869465 -16.56114 6.949731 -3.440318 -0.4015923 1.338336 -13.04577 -10.98161 -0.4763947 9.889109

16.03437 -11.54739 -36.76908 -4.536423 -24.24585 9.95239 -15.34235 -0.2514563 19.19995 5.409452 -2.524536 4.166977 10.39466

16.39969 -8.784815 -41.10754 -6.428618 -22.93435 2.602599 -15.65443 -2.352367 8.833455 -6.360246 -14.28245 -1.446442 2.700707

上面的输出类型是t(文本),并且用”-“表示输出到标准输出(屏幕),这样便于查看。

调试

阅读代码最好的方式就是单步执行查看变量,调试Kaldi的代码请参考Kaldi教程(三)的代码调试部分。

注意编译时要把kaldi.mk里的-O1去掉改成”-O0 -DKALDI_PARANOID”,具体参考上面的文档。

compute-mfcc-feats.cc

这个文件是提取mfcc的工具,包括入口的main函数,它只是做参数的parse,真正干活的代码不多,这里只保留主要的代码。

int main(int argc, char *argv[]) {

try {

ParseOptions po(usage);

MfccOptions mfcc_opts;

// Register the MFCC option struct.

mfcc_opts.Register(&po);

// Register the options.

po.Register("output-format", &output_format, "Format of the output "

"files [kaldi, htk]");

std::string wav_rspecifier = po.GetArg(1);

std::string output_wspecifier = po.GetArg(2);

Mfcc mfcc(mfcc_opts);

SequentialTableReader reader(wav_rspecifier);

BaseFloatMatrixWriter kaldi_writer; // typedef to TableWriter.

int32 num_utts = 0, num_success = 0;

for (; !reader.Done(); reader.Next()) {

num_utts++;

std::string utt = reader.Key();

const WaveData &wave_data = reader.Value();

SubVector waveform(wave_data.Data(), this_chan);

Matrix features;

try {

mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),

vtln_warp_local, &features);

} catch (...) {

}

if (subtract_mean) {

Vector mean(features.NumCols());

mean.AddRowSumMat(1.0, features);

mean.Scale(1.0 / features.NumRows());

for (int32 i = 0; i < features.NumRows(); i++)

features.Row(i).AddVec(-1.0, mean);

}

if (output_format == "kaldi") {

kaldi_writer.Write(utt, features);

} else {

}

if (utt2dur_writer.IsOpen()) {

utt2dur_writer.Write(utt, wave_data.Duration());

}

}

} catch(const std::exception &e) {

}

}

前面的parse参数和选项我们可以略过,”Mfcc mfcc(mfcc_opts);”,这是真正干活的。

接下来”SequentialTableReader reader(wav_rspecifier);”用于读取wav文件。然后是用for循环读取每一个wav文件。for循环里面是处理每一个wav文件的代码:

const WaveData &wave_data = reader.Value();

// 使用SubVector取wav文件的一个通道

SubVector waveform(wave_data.Data(), this_chan);

Matrix features;

try {

//使用mfcc提取特征。

mfcc.ComputeFeatures(waveform, wave_data.SampFreq(),

vtln_warp_local, &features);

如果有subtract_mean,则同一个utterance的每一维特征都减去均值,最后使用”kaldi_writer.Write(utt, features);”输出。

WaveHolder读取WAV文件

我们是通过下面的代码来构造WAV文件的reader:

SequentialTableReader reader(wav_rspecifier);

这里使用SequentialTableReader,它会读取前面的wav.scp文件,但是具体怎么读取”flac -c -d -s /home/lili/codes/kaldi/egs/mini_librispeech/s5/mini-libridata/LibriSpeech/train-clean-5/1088/134315/1088-134315-0000.flac”输出的WAV文件,则需要靠WaveHolder了。

它读取WAV的代码是:

bool Read(std::istream &is) {

// We don't look for the binary-mode header here [always binary]

try {

t_.Read(is); // Throws exception on failure.

return true;

} catch (const std::exception &e) {

KALDI_WARN << "Exception caught in WaveHolder::Read(). " << e.what();

return false;

}

}

上面的t_是成员变量(Kaldi的coding style是私有成员以下划线结尾),在WaveHolder的开始我们可以看到”typedef WaveData T;”,因此最终调用到WaveData类的Read():

void WaveData::Read(std::istream &is) {

const uint32 kBlockSize = 1024 * 1024;

WaveInfo header;

header.Read(is);

data_.Resize(0, 0); // clear the data.

samp_freq_ = header.SampFreq();

std::vector buffer;

uint32 bytes_to_go = header.IsStreamed() ? kBlockSize : header.DataBytes();

// Once in a while header.DataBytes() will report an insane value;

// read the file to the end

while (is && bytes_to_go > 0) {

uint32 block_bytes = std::min(bytes_to_go, kBlockSize);

uint32 offset = buffer.size();

buffer.resize(offset + block_bytes);

is.read(&buffer[offset], block_bytes);

uint32 bytes_read = is.gcount();

buffer.resize(offset + bytes_read);

if (!header.IsStreamed())

bytes_to_go -= bytes_read;

}

uint16 *data_ptr = reinterpret_cast(&buffer[0]);

// The matrix is arranged row per channel, column per sample.

data_.Resize(header.NumChannels(),

buffer.size() / header.BlockAlign());

for (uint32 i = 0; i < data_.NumCols(); ++i) {

for (uint32 j = 0; j < data_.NumRows(); ++j) {

int16 k = *data_ptr++;

if (header.ReverseBytes())

KALDI_SWAP2(k);

data_(j, i) = k;

}

}

}

上面的代码显示读取WAV的Header信息,确保文件的采样率,比特率等和compute-mfcc-feats要求的(我们没有设置的话都有合理的默认值,这些默认值请查看看compute-mfcc-feats的帮助)是否一样。关于WAV格式,感兴趣的读者可以参考Digital Audio - Creating a WAV (RIFF) file和WAVE PCM soundfile format。读者可以对着WAV格式头部说明然后单步调试void WaveInfo::Read(std::istream &is)函数。

Mfcc

Mfcc是OfflineFeatureTpl模板类使用MfccComputer的typedef:

typedef OfflineFeatureTpl Mfcc;

OfflineFeatureTpl

我们先看一下OfflineFeatureTpl的文档

/// This templated class is intended for offline feature extraction, i.e. where

/// you have access to the entire signal at the start. It exists mainly to be

/// drop-in replacement for the old (pre-2016) classes Mfcc, Plp and so on, for

/// use in the offline case. In April 2016 we reorganized the online

/// feature-computation code for greater modularity and to have correct support

/// for the snip-edges=false option.

template

class OfflineFeatureTpl {

大意是:2016年前的代码是Mfcc和Plp这些具体的类,它们用于计算离线的特征,而之后重构了代码。

ComputeFeatures

我们来看计算特征的函数:

template

void OfflineFeatureTpl::ComputeFeatures(

const VectorBase &wave,

BaseFloat sample_freq,

BaseFloat vtln_warp,

Matrix *output) {

KALDI_ASSERT(output != NULL);

BaseFloat new_sample_freq = computer_.GetFrameOptions().samp_freq;

if (sample_freq == new_sample_freq) {

Compute(wave, vtln_warp, output);

} else {

if (new_sample_freq < sample_freq &&

! computer_.GetFrameOptions().allow_downsample)

KALDI_ERR << "Waveform and config sample Frequency mismatch: "

<< sample_freq << " .vs " << new_sample_freq

<< " (use --allow-downsample=true to allow "

<< " downsampling the waveform).";

else if (new_sample_freq > sample_freq &&

! computer_.GetFrameOptions().allow_upsample)

KALDI_ERR << "Waveform and config sample Frequency mismatch: "

<< sample_freq << " .vs " << new_sample_freq

<< " (use --allow-upsample=true option to allow "

<< " upsampling the waveform).";

// Resample the waveform.

Vector resampled_wave(wave);

ResampleWaveform(sample_freq, wave,

new_sample_freq, &resampled_wave);

Compute(resampled_wave, vtln_warp, output);

}

}

它其实只是检查从WAV头部读取的采样率和compute-mfcc-feats传入的是否一致,如果一致使用Compute函数计算,否则如果运行的话对WAV文件进行上采样或者下采样以便满足compute-mfcc-feats的要求,最终还是调用Compute函数。

Compute

template

void OfflineFeatureTpl::Compute(

const VectorBase &wave,

BaseFloat vtln_warp,

Matrix *output) {

//.......

int32 rows_out = NumFrames(wave.Dim(), computer_.GetFrameOptions()),

cols_out = computer_.Dim();

//.......

for (int32 r = 0; r < rows_out; r++) { // r is frame index.

BaseFloat raw_log_energy = 0.0;

ExtractWindow(0, wave, r, computer_.GetFrameOptions(),

feature_window_function_, &window,

(use_raw_log_energy ? &raw_log_energy : NULL));

SubVector output_row(*output, r);

computer_.Compute(raw_log_energy, vtln_warp, &window, &output_row);

}

首先使用NumFrames计算WAV有多少帧,然后遍历每一帧:使用ExtractWindow抽取每一帧,然后使用computer_.Compute提取特征。

NumFrames

目前默认的方式是snip_edges,也和HTK一致,也就是保证不需要padding,如果往后移动超出范围,那就不要了。代码这里就不介绍了。

ExtractWindow

我们只看一下函数的参数和说明。

// ExtractWindow extracts a windowed frame of waveform with a power-of-two,

// padded size. It does mean subtraction, pre-emphasis and dithering as

// requested.

void ExtractWindow(int64 sample_offset,

const VectorBase &wave,

int32 f, // with 0 <= f < NumFrames(feats, opts)

const FrameExtractionOptions &opts,

const FeatureWindowFunction &window_function,

Vector *window,

BaseFloat *log_energy_pre_window) {

因为要做FFT,所以要求采样点是2的幂,所以会在后面padding。此外如果FrameExtractionOptions需要减去均值或者pre-emphasis或者dithering(增加很小的随机噪声防止log为0),都会在这里处理,最后会把这些点乘以窗口函数FeatureWindowFunction(比如Hamming窗)。

MfccComputer

最终到了干活的代码了,上面会调用它的Compute函数,请阅读注释:

void MfccComputer::Compute(BaseFloat signal_log_energy,

BaseFloat vtln_warp,

VectorBase *signal_frame,

VectorBase *feature) {

KALDI_ASSERT(signal_frame->Dim() == opts_.frame_opts.PaddedWindowSize() &&

feature->Dim() == this->Dim());

// 获取Mel FilterBank,为了复用,会把每一个VLTN的alpha作为key存在map里。

const MelBanks &mel_banks = *(GetMelBanks(vtln_warp));

if (opts_.use_energy && !opts_.raw_energy)

// 用向量向量乘法计算能量

signal_log_energy = Log(std::max(VecVec(*signal_frame, *signal_frame),

std::numeric_limits::epsilon()));

// FFT,默认是split-radix算法

if (srfft_ != NULL) // Compute FFT using the split-radix algorithm.

srfft_->Compute(signal_frame->Data(), true);

else // An alternative algorithm that works for non-powers-of-two.

RealFft(signal_frame, true);

// FFT得到的复数计算其模得到功率谱

ComputePowerSpectrum(signal_frame);

// 因为是实数,FFT只需要N/2+1个点。

SubVector power_spectrum(*signal_frame, 0,

signal_frame->Dim() / 2 + 1);

// 使用Filter bank滤波器组提取每个bin的能量

mel_banks.Compute(power_spectrum, &mel_energies_);

// 避免对零取log (如果有dithering那么不应该是零,但是dithering是可选的,所以保险一点还是要处理)

mel_energies_.ApplyFloor(std::numeric_limits::epsilon());

mel_energies_.ApplyLog(); // 取log

feature->SetZero(); // in case there were NaNs.

// feature = dct_matrix_ * mel_energies [which now have log]

// 进行DCT得到倒谱

feature->AddMatVec(1.0, dct_matrix_, kNoTrans, mel_energies_, 0.0);

if (opts_.cepstral_lifter != 0.0)

// 倒谱系数的lifting

feature->MulElements(lifter_coeffs_);

if (opts_.use_energy) {

// 如果使用能量,那么把这一帧的能量替换掉倒谱的第一个系数。

if (opts_.energy_floor > 0.0 && signal_log_energy < log_energy_floor_)

signal_log_energy = log_energy_floor_;

(*feature)(0) = signal_log_energy;

}

if (opts_.htk_compat) {

// HTK的能量参数放在最后面,为了便于对比,MfccOptions有一个htk_compat选项。

BaseFloat energy = (*feature)(0);

for (int32 i = 0; i < opts_.num_ceps - 1; i++)

(*feature)(i) = (*feature)(i+1);

if (!opts_.use_energy)

energy *= M_SQRT2; // scale on C0 (actually removing a scale

// we previously added that's part of one common definition of

// the cosine transform.)

(*feature)(opts_.num_ceps - 1) = energy;

}

}

代码其实很清晰,不了解的读者也可以参考MFCC特征提取和模块二:语音信号处理。这两篇文章都有Python的代码实现可以参考。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个基本的MFCC特征提取的MATLAB代码: % 定义常量 frame_size = 256; % 帧大小 frame_shift = 128; % 帧移 num_filters = 20; % 滤波器数量 num_ceps = 12; % MFCC系数数量 lifter = 22; % lifter系数 % 读取音频文件 [x, fs] = audioread('sample.wav'); % 预加重 pre_emph = [1 -0.97]; x = filter(pre_emph, 1, x); % 分帧 frames = buffer(x, frame_size, frame_size-frame_shift, 'nodelay'); % 加窗 win = hamming(frame_size); frames = frames .* repmat(win, 1, size(frames, 2)); % 计算功率谱 pow_frames = abs(fft(frames)).^2; pow_frames = pow_frames(1:frame_size/2+1, :); % 计算Mel滤波器组 mel_filters = melFilterBank(fs, num_filters, frame_size); % 应用Mel滤波器组 mel_pow_frames = mel_filters * pow_frames; % 取对数 log_mel_pow_frames = log(mel_pow_frames); % 应用DCT dct_mel_pow_frames = dct(log_mel_pow_frames); % 取MFCC系数 mfcc = dct_mel_pow_frames(1:num_ceps, :); % 应用lifter lifter_coeffs = (1:lifter)'; lifter_coeffs = sin(pi/lifter*(lifter_coeffs-0.5)); mfcc = mfcc .* repmat(lifter_coeffs, 1, size(mfcc, 2)); % 显示MFCC系数 imagesc(mfcc); colorbar; % Mel滤波器组函数 function mel_filters = melFilterBank(fs, num_filters, frame_size) % 定义频率范围 f_min = 0; f_max = fs/2; mel_min = 1125*log(1+f_min/700); mel_max = 1125*log(1+f_max/700); % 计算Mel刻度的间隔 mel_interval = (mel_max - mel_min) / (num_filters + 1); % 将Mel刻度转换为频率 freqs = 700*(exp(mel_min+(0:num_filters+1)*mel_interval/1125)-1); % 将频率转换为FFT索引 fft_indices = floor((frame_size+1)*freqs/fs); % 构建滤波器组 mel_filters = zeros(num_filters, frame_size/2+1); for i = 1:num_filters left = fft_indices(i); center = fft_indices(i+1); right = fft_indices(i+2); for j = left:center mel_filters(i, j) = (j-left)/(center-left); end for j = center+1:right mel_filters(i, j) = (right-j)/(right-center); end end end
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值