使用kaldi中的x-vector在aishell数据库上建立说话人识别系统

最新推荐文章于 2021-09-07 21:36:31 发布

qiny1012

最新推荐文章于 2021-09-07 21:36:31 发布

阅读量3.4k

点赞数 10

文章标签：深度学习 linux

本文链接：https://blog.csdn.net/qq_27182145/article/details/109490538

版权

使用kaldi中的x-vector在aishell数据库上建立说话人识别系统

写在前面

整个系统可以分为三个部分，第一，前端预处理部分，主要包括mfcc特征提取，VAD，数据扩充（增加混响、增加不同类型的噪声）等；第二，基于TDNN的特征提取器，该结构生成说话人表征，说话人表征也可以被称为embedding、x-vector；第三，后端处理，对于说话人表征，采用LDA进行降维并训练PLDA模型对测试对进行打分。

x-vector的论文发表在ICASSP 2018，kaldi的核心开发者Daniel Povey也是这篇论文的作者之一，论文来链接如下：

X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION

开始构建系统

使用kaldi进行建立基于x-vector的说话人识别系统，主要是通过脚本来实现的。在官方项目x-vector(~/kaldi/egs/sre16/v2) 中，就是通过run.sh这个脚本来运行的。但是在迁移过程中有部分代码进行了修改，于是我将原有的run.sh脚本分成了9个内容更小的脚本，并且在jupyter notebook中运行，jupyter notebook记录了每一段脚本的log，可以帮助我们更好的理解脚本的含义。

相关代码可以在github上查看:
https://github.com/qiny1012/kaldi_x-vector_aishell

准备工作

准备好AISHELL，MUSAN，RIRS_NOISES的数据库；
将x-vector的项目复制到合适的位置，原来项目的地址在~/kaldi/egs/sre16/v2；
在path.sh中配置kaldi的存放位置，这个代码会将kaldi的目录增加到环境变量中；
修改cmd.sh中的内容，将queue.pl改为run.pl，并设置一个合适自己计算机的内存大小。注意，使用多台计算机并行运算使用queue.pl，一台计算机运算使用run.pl；
将一些脚本从下面这些地址中复制到当前的项目的local中：

~/kaldi/egs/aishell/v1/local/aishell_data_prep.sh

~/kaldi/egs/aishell/v1/local/produce_trials.py

~/kaldi/egs/aishell/v1/local/split_data_enroll_eval.py

~/kaldi/egs/voxceleb/v2/local/prepare_for_eer.py

运行脚本

1.准备训练集、测试集的配置文件

# 设置环境变量
. ./path.sh
set -e # exit on error
# aishell数据库存放的位置
data=/home/aishell数据库存放的位置
# 准备数据
local/aishell_data_prep.sh $data/data_aishell/wav $data/data_aishell/transcript

这段脚本将当前的项目的目录中建立一个data文件，里面将生成一些配置文件。

核心的配置文件包括下面这些：

data
├── test
│ ├── spk2utt
│ ├── text
│ ├── utt2spk
│ └── wav.scp
└── train
├── spk2utt
├── text
├── utt2spk
└── wav.scp

其中spk2utt文件是每个说话人和语音段的映射，text是每一个语音段的转义文本（没有有使用），utt2spk是每个语音段和说话人的映射，wav.scp是每个语音段和具体位置的映射。

2.准备原始语音的mfcc表征，并进行VAD操作

## 设置train_cmd的变量
. ./cmd.sh
# 设置环境变量
. ./path.sh

## 特征存放位置
mfccdir=feature/mfcc
vaddir=feature/mfcc

for name in train test; do
    steps/make_mfcc.sh --write-utt2num-frames true --mfcc-config conf/mfcc.conf --nj 20 --cmd "$train_cmd" \
        data/${name} exp/make_mfcc $mfccdir
    utils/fix_data_dir.sh data/${name}
    sid/compute_vad_decision.sh --nj 20 --cmd "$train_cmd" \
        data/${name} exp/make_vad $vaddir
    utils/fix_data_dir.sh data/${name}
done

将data/train和data/test中涉及的文件提取mfcc特征，并进行VAD操作。conf/mfcc.conf中存放了mfcc的参数。

3.数据扩充（加混响，加噪音）

. ./cmd.sh
. ./path.sh

## 生成增加混响的配置文件 data/train_reverb
steps/data/reverberate_data_dir.py \
    --rir-set-parameters "0.5, /home/你的目录/RIRS_NOISES/simulated_rirs/smallroom/rir_list" \
    --rir-set-parameters "0.5, /home/你的目录/RIRS_NOISES/simulated_rirs/mediumroom/rir_list" \
    --speech-rvb-probability 1 \
    --pointsource-noise-addition-probability 0 \
    --isotropic-noise-addition-probability 0 \
    --num-replications 1 \
    --source-sampling-rate 16000 \
    data/train data/train_reverb

cp data/train/vad.scp data/train_reverb/
utils/copy_data_dir.sh --utt-suffix "-reverb" data/train_reverb data/train_reverb.new
rm -rf data/train_reverb
mv data/train_reverb.new data/train_reverb

## 准备musan的数据库
steps/data/make_musan.sh --sampling-rate 16000 /home/qinyc/openSLR/SLR17_musan/musan data
# Get the duration of the MUSAN recordings.  This will be used by the
# script augment_data_dir.py.
for name in speech noise music; do
    utils/data/get_utt2dur.sh data/musan_${name}
    mv data/musan_${name}/utt2dur data/musan_${name}/reco2dur
done

# Augment with musan_noise
steps/data/augment_data_dir.py --utt-suffix "noise" --fg-interval 1 --fg-snrs "15:10:5:0" --fg-noise-dir "data/musan_noise" data/train data/train_noise
# Augment with musan_music
steps/data/augment_data_dir.py --utt-suffix "music" --bg-snrs "15:10:8:5" --num-bg-noises "1" --bg-noise-dir "data/musan_music" data/train data/train_music
# Augment with musan_speech
steps/data/augment_data_dir.py --utt-suffix "babble" --bg-snrs "20:17:15:13" --num-bg-noises "3:4:5:6:7" --bg-noise-dir "data/musan_speech" data/train data/train_babble

# 将train_reverb,train_noise,train_music,train_babble的配置文件整合到train_aug文件夹下。
utils/combine_data.sh data/train_aug data/train_reverb data/train_noise data/train_music data/train_babble

每个段语音分别被加混响，加noise，加music，加speech增强，扩充了4倍的数据。

在混响增强中，生成的文件中总是没有数据库的前缀，导致在下一步过程中无法合成并提取特征。

解决方法：将step/data/reverberate_data_dir.py中539行改为：rir.rir_rspecifier = "sox /home/RIR_NOISE存放的目录/{0} -r {1} -t wav - |".format(rir.rir_rspecifier, sampling_rate)

4.提取扩充数据的mfcc表征

. ./cmd.sh
. ./path.sh

mfccdir=feature/mfcc
steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 20 --cmd "$train_cmd" \
    data/train_aug exp/make_mfcc $mfccdir

5.过滤语音

. ./cmd.sh
. ./path.sh

# 1. 将train_aug数据和train数据合并
utils/combine_data.sh data/train_combined data/train_aug data/train
utils/fix_data_dir.sh data/train_combined 
 
# 2. 该脚本应用CMVN并删除非语音帧。 会重新复制一份feature，确保磁盘空间。
local/nnet3/xvector/prepare_feats_for_egs.sh --nj 40 --cmd "$train_cmd" \
    data/train_combined data/train_combined_no_sil exp/train_combined_no_sil
utils/fix_data_dir.sh data/train_combined_no_sil
  
# 3. 如果语音长度少于200帧，会被去除。200帧，（25ms帧长，10ms帧移），大约2s
min_len=200
mv data/train_combined_no_sil/utt2num_frames data/train_combined_no_sil/utt2num_frames.bak
awk -v min_len=${min_len} '$2 > min_len {print $1, $2}' data/train_combined_no_sil/utt2num_frames.bak > data/train_combined_no_sil/utt2num_frames
utils/filter_scp.pl data/train_combined_no_sil/utt2num_frames data/train_combined_no_sil/utt2spk > data/train_combined_no_sil/utt2spk.new
mv data/train_combined_no_sil/utt2spk.new data/train_combined_no_sil/utt2spk
utils/fix_data_dir.sh data/train_combined_no_sil

# 4. 去除语音数量少于8个的说话人。
min_num_utts=8
awk '{print $1, NF-1}' data/train_combined_no_sil/spk2utt > data/train_combined_no_sil/spk2num
awk -v min_num_utts=${min_num_utts} '$2 >= min_num_utts {print $1, $2}' data/train_combined_no_sil/spk2num | utils/filter_scp.pl - data/train_combined_no_sil/spk2utt > data/train_combined_no_sil/spk2utt.new
mv data/train_combined_no_sil/spk2utt.new data/train_combined_no_sil/spk2utt
utils/spk2utt_to_utt2spk.pl data/train_combined_no_sil/spk2utt > data/train_combined_no_sil/utt2spk

utils/filter_scp.pl data/train_combined_no_sil/utt2spk data/train_combined_no_sil/utt2num_frames > data/train_combined_no_sil/utt2num_frames.new
mv data/train_combined_no_sil/utt2num_frames.new data/train_combined_no_sil/utt2num_frames
# Now we're ready to create training examples.
utils/fix_data_dir.sh data/train_combined_no_sil

如代码中的注释，主要进行了4个工作，第一个工作，将data/train和data/train_aug的配置文件合并，配置文件合并在之前的代码中已经使用过；第二个工作，按照VAD的结果去除静音帧，并在每一段语音进行归一化（CMVN）；第三个工作，去除语音长度小于2s的语音，这个操作需要和在训练过程中的最大帧参数保持一致；第四个工作，去除语音数量少于8的说话人。

6.训练x-vector

在开始执行下面的内容，生成一个0.raw的文件。如果没有这一步会出现错误，解决方法来自：https://blog.csdn.net/weixin_43056919/article/details/87480205

bash /home/你的kaldi目录/kaldi/src/nnet3bin/nnet3-init  ./exp/xvector_nnet_1a/nnet.config ./exp/xvector_nnet_1a/0.raw

stage=0
train_stage=0
use_gpu=true
remove_egs=false

data=data/train_combined_no_sil 
nnet-dir=exp/xvector_nnet_1a 
egs-dir=exp/xvector_nnet_1a/egs

. ./path.sh
. ./cmd.sh
. ./utils/parse_options.sh

num_pdfs=$(awk '{print $2}' $data/utt2spk | sort | uniq -c | wc -l)

if [ $stage -le 4 ]; then
  echo "$0: Getting neural network training egs";
  # dump egs.
  if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d $egs_dir/storage ]; then
    utils/create_split_dir.pl \
     /home/qinyc/aishell/v2/xvector-$(date +'%m_%d_%H_%M')/$egs_dir/storage $egs_dir/storage
  fi
  sid/nnet3/xvector/get_egs.sh --cmd "$train_cmd" \
    --nj 8 \
    --stage 0 \
    --frames-per-iter 1000000000 \
    --frames-per-iter-diagnostic 100000 \
    --min-frames-per-chunk 100 \
    --max-frames-per-chunk 200 \ ## 每个语音的最大帧
    --num-diagnostic-archives 3 \
    --num-repeats 35 \
    "$data" $egs_dir
fi

if [ $stage -le 5 ]; then
  echo "$0: creating neural net configs using the xconfig parser";
  num_targets=$(wc -w $egs_dir/pdf2num | awk '{print $1}')
  feat_dim=$(cat $egs_dir/info/feat_dim)

  # This chunk-size corresponds to the maximum number of frames the
  # stats layer is able to pool over.  In this script, it corresponds
  # to 100 seconds.  If the input recording is greater than 100 seconds,
  # we will compute multiple xvectors from the same recording and average
  # to produce the final xvector.
  max_chunk_size=10000

  # The smallest number of frames we're comfortable computing an xvector from.
  # Note that the hard minimum is given by the left and right context of the
  # frame-level layers.
  min_chunk_size=25
  mkdir -p $nnet_dir/configs
  cat <<EOF > $nnet_dir/configs/network.xconfig
  # please note that it is important to have input layer with the name=input

  # The frame-level layers
  input dim=${feat_dim} name=input
  relu-batchnorm-layer name=tdnn1 input=Append(-2,-1,0,1,2) dim=512
  relu-batchnorm-layer name=tdnn2 input=Append(-2,0,2) dim=512
  relu-batchnorm-layer name=tdnn3 input=Append(-3,0,3) dim=512
  relu-batchnorm-layer name=tdnn4 dim=512
  relu-batchnorm-layer name=tdnn5 dim=1500

  # The stats pooling layer. Layers after this are segment-level.
  # In the config below, the first and last argument (0, and ${max_chunk_size})
  # means that we pool over an input segment starting at frame 0
  # and ending at frame ${max_chunk_size} or earlier.  The other arguments (1:1)
  # mean that no subsampling is performed.
  stats-layer name=stats config=mean+stddev(0:1:1:${max_chunk_size})

  # This is where we usually extract the embedding (aka xvector) from.
  relu-batchnorm-layer name=tdnn6 dim=512 input=stats

  # This is where another layer the embedding could be extracted
  # from, but usually the previous one works better.
  relu-batchnorm-layer name=tdnn7 dim=512
  output-layer name=output include-log-softmax=true dim=${num_targets}
EOF

  steps/nnet3/xconfig_to_configs.py \
      --xconfig-file $nnet_dir/configs/network.xconfig \
      --config-dir $nnet_dir/configs/
  cp $nnet_dir/configs/final.config $nnet_dir/nnet.config

  # These three files will be used by sid/nnet3/xvector/extract_xvectors.sh
  echo "output-node name=output input=tdnn6.affine" > $nnet_dir/extract.config
  echo "$max_chunk_size" > $nnet_dir/max_chunk_size
  echo "$min_chunk_size" > $nnet_dir/min_chunk_size
fi

dropout_schedule='0,0@0.20,0.1@0.50,0'
srand=123
if [ $stage -le 6 ]; then
  steps/nnet3/train_raw_dnn.py --stage=$train_stage \
    --cmd="$train_cmd" \
    --trainer.optimization.proportional-shrink 10 \
    --trainer.optimization.momentum=0.5 \
    --trainer.optimization.num-jobs-initial=1 \
    --trainer.optimization.num-jobs-final=1 \
    --trainer.optimization.initial-effective-lrate=0.001 \
    --trainer.optimization.final-effective-lrate=0.0001 \
    --trainer.optimization.minibatch-size=64 \
    --trainer.srand=$srand \
    --trainer.max-param-change=2 \
    --trainer.num-epochs=80 \  ## 语音次数
    --trainer.dropout-schedule="$dropout_schedule" \
    --trainer.shuffle-buffer-size=1000 \
    --egs.frames-per-eg=1 \
    --egs.dir="$egs_dir" \
    --cleanup.remove-egs $remove_egs \
    --cleanup.preserve-model-interval=10 \
    --use-gpu=true \ ## 是否使用GPU
    --dir=$nnet_dir  || exit 1;
fi
exit 0;

最终训练好的模型存放的位置是exp/xvector_nnet_1a/final.raw

7.提取训练数据的说话人表征，然后LDA降维再训练PLDA

. ./path.sh
. ./cmd.sh

# 计算训练集的x-vector
nnet_dir=exp/xvector_nnet_1a
sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 12G" --nj 20\
    $nnet_dir data/train_combined \
    exp/xvectors_train_combined
    
# 计算训练集x-vector的均值
$train_cmd exp/xvectors_train_combined/log/compute_mean.log \
    ivector-mean scp:exp/xvectors_train_combined/xvector.scp \
    exp/xvectors_train_combined/mean.vec || exit 1;

# 利用LDA将数据进行降维
lda_dim=150
$train_cmd exp/xvectors_train_combined/log/lda.log \
    ivector-compute-lda --total-covariance-factor=0.0 --dim=$lda_dim \
    "ark:ivector-subtract-global-mean scp:exp/xvectors_train_combined/xvector.scp ark:- |" \
    ark:data/train_combined/utt2spk exp/xvectors_train_combined/transform.mat || exit 1;

## 训练PLDA模型
# Train an out-of-domain PLDA model.
$train_cmd exp/xvectors_train_combined/log/plda.log \
    ivector-compute-plda ark:data/train_combined/spk2utt \
    "ark:ivector-subtract-global-mean scp:exp/xvectors_train_combined/xvector.scp ark:- | transform-vec exp/xvectors_train_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:-  ark:- |" \
    exp/xvectors_train_combined/plda || exit 1;

8.在测试集中制作任务，并提取他们的说话人表征

. ./path.sh
. ./cmd.sh

nnet_dir=exp/xvector_nnet_1a
mfccdir=mfcc
vaddir=vad

#split the test to enroll and eval
mkdir -p data/test/enroll data/test/eval
cp data/test/spk2utt data/test/enroll
cp data/test/spk2utt data/test/eval
cp data/test/feats.scp data/test/enroll
cp data/test/feats.scp data/test/eval
cp data/test/vad.scp data/test/enroll
cp data/test/vad.scp data/test/eval

# 划分注册语音和测试语音
local/split_data_enroll_eval.py data/test/utt2spk  data/test/enroll/utt2spk  data/test/eval/utt2spk
trials=data/test/aishell_speaker_ver.lst
local/produce_trials.py data/test/eval/utt2spk $trials
utils/fix_data_dir.sh data/test/enroll
utils/fix_data_dir.sh data/test/eval
# 提取测试机的x-vector表征
sid/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd --mem 4G" --nj 1 --use-gpu true \
    $nnet_dir data/test \
    $nnet_dir/xvectors_test

9.利用PLAD计算每个测试对的得分，并计算最终的EER以及minDCF

. ./path.sh
. ./cmd.sh

nnet_dir=exp/xvector_nnet_1a
mfccdir=mfcc
vaddir=vad
trials=data/test/aishell_speaker_ver.lst
$train_cmd exp/scores/log/test_score.log \
    ivector-plda-scoring --normalize-length=true \
   "ivector-copy-plda --smoothing=0.0 exp/xvectors_train_combined/plda - |" \
    "ark:ivector-subtract-global-mean exp/xvectors_train_combined/mean.vec scp:$nnet_dir/xvectors_test/xvector.scp ark:- | transform-vec exp/xvectors_train_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" ## 测试语音 \
    "ark:ivector-subtract-global-mean exp/xvectors_train_combined/mean.vec scp:$nnet_dir/xvectors_test/spk_xvector.scp ark:- | transform-vec exp/xvectors_train_combined/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark:- |" ## 注册语音 \
    "cat '$trials' | cut -d\  --fields=1,2 |" exp/scores_test || exit 1;
# 输出的得分文件为exp/scores_test

#计算EER以及minDCF等
eer=`compute-eer <(local/prepare_for_eer.py $trials exp/scores_test) 2> /dev/null`
mindcf1=`sid/compute_min_dcf.py --p-target 0.01 exp/scores_test $trials 2> /dev/null`
mindcf2=`sid/compute_min_dcf.py --p-target 0.001 exp/scores_test $trials 2> /dev/null`
echo "EER: $eer%"
echo "minDCF(p-target=0.01): $mindcf1"
echo "minDCF(p-target=0.001): $mindcf2"

训练了80次的结果：
EER: 0.6745%
minDCF(p-target=0.01): 0.1043
minDCF(p-target=0.001): 0.1816

qiny1012

关注

10
点赞
踩
40

收藏

觉得还不错? 一键收藏
6
评论
使用kaldi中的x-vector在aishell数据库上建立说话人识别系统

使用kaldi中的x-vector在aishell数据库上建立说话人识别系统写在前面整个系统可以分为三个部分，第一，前端预处理部分，主要包括mfcc特征提取，VAD，数据扩充（增加混响、增加不同类型的噪声）等；第二，基于TDNN的特征提取器，该结构生成说话人表征，说话人表征也可以被称为embedding、x-vector；第三，后端处理，对于说话人表征，采用LDA进行降维并训练PLDA模型对测试对进行打分。x-vector的论文发表在ICASSP 2018，kaldi的核心开发者Daniel Pove
复制链接

扫一扫