如何用 Kaldi 训练一个 DNN 声学模型

最新推荐文章于 2024-06-06 16:30:08 发布

会飞行的小蜗牛

最新推荐文章于 2024-06-06 16:30:08 发布

阅读量1.3w

点赞数 2

分类专栏：语音识别

语音识别专栏收录该内容

24 篇文章 2 订阅

订阅专栏

英文原文地址：点击打开链接

本人译文如下：

1. 介绍：

首先，需要完成标准的 GMM-HMM 声学模型的训练

训练 monophone model 是通过 GMM-HMM System 做 utterance-level transcriptions，即训练 label-audio 的映射

训练 triphone model 是通过 GMM-HMM System 做 phoneme-to-audio aglignments

因此，DNN 是严格依赖于 GMM-HMM 的质量，如果 GMM-HMM 很差，那么 DNN 的结果也好不到哪里去（不管你用了多少个 epoch，用了什么样的 cost function，你的用了多么聪明的 learning rate）；相反，如果 GMM-HMM 质量很高，那么 DNN 结果也会有很大的提升。

一个神经网络就是一个分类工具，能够将一些新的特征(如声学特征)分类到某一个 class。DNN 的输入 nodes 一般为 39 维的 MFCC 特征，输出的 nodes 为相关的 labels(eg: 900 个输出 <-> 900 个 context-dependent triphones[即 decision tree leaves])。也就是说：Acoustic features 用于训练 GMM-HMM 和 decision tree，这两部分是 Acoustic model(input layer and outlayer) 建模的关键部分。

隐藏层的尺寸不受限于前面讲的 GMM-HMM 结构或声学特征的维度，而取决于模型的研究人员和开发者。

一旦确定了 DNN 确定的 input node 和 output node 的维度，就可以做 phoneme-to-audio alignment 和训练神经网络。

audio feature frames 做为 input layer 的输入，网络将为该 frame 分配一个 phoneme label；对于任意给定的 fames, 我们已经有对应的 gold-standard label(eg: 做 GMM-HMM alignments 后的 phoneme label) ，我们就可以比较网络输出的 phoneme lable 与真实的 phoneme，使用 loss function 和 backpropigation ，我们就可以迭代训练所有的 frames 得到网络层合适的 weights 和 biases。

注意，不像训练 GMM-HMM 时，需要对 audio frames 使用 EM 算法做 iteratively realign transcriptions；在 DNN 训练时不需要这样的工作。

最后，我们的目的是得到这样一个 DNN，它能将一个正确的 phoneme label 分配给相应的输入 audio frame。

2. 训练一个 DNN

主要的过程如下：

CMVN adaptation of raw (MFCC/PLP) features
pnorm non-linearities
online preconditioning of weights and biases
all training diagnostics (no more validation examples)
final model combination
weighting of posteriors of silence phones
“mixing up” the number of nodes before the Softmax layer

1) First Things First: train a GMM system and generate alignments

a training data dir (as generated by a prepare_data.sh script in a s5/local directory)
a language dir (which has information on your phones, decision tree, etc, probably generated byprepare_lang.sh)
an alignment dir (generated by something like align_si.sh).
a feature dir (for example MFCCs; made by the make_mfcc.sh script)

train 目录结构如下：

我的项目中 train 目录位于 data/train

train/
├── feats.scp
└── split4
    ├── 1
    │   └── feats.scp
    ├── 2
    │   └── feats.scp
    ├── 3
    │   └── feats.scp
    └── 4
        └── feats.scp

lang 目录结构如下：

我的项目中 lang 目录位于 data/lang

lang/
└── topo

align 目录结构如下：

我的项目中 align 目录位于 exp/tri1_ali 或者 exp/tri2_ali。。。

triphones_aligned/
├── ali.1.gz
├── ali.2.gz
├── ali.3.gz
├── ali.4.gz
├── final.mdl
├── num_jobs
└── tree

mfcc 目录结构如下：

我的项目中 mfcc 目录位于 exp/make_mfcc

mfcc/
├── raw_mfcc_train.1.ark
├── raw_mfcc_train.1.scp
├── raw_mfcc_train.2.ark
├── raw_mfcc_train.2.scp
├── raw_mfcc_train.3.ark
├── raw_mfcc_train.3.scp
├── raw_mfcc_train.4.ark
└── raw_mfcc_train.4.scp

2) 训练 DNN 时，主要的 RUN 脚本

为了简单说明框架，去掉了其它繁琐的细节

run_nnet2.sh，内容类似下面：

#!/bin/bash

# Joshua Meyer 2017
# This script is based off the run_nnet2_baseline.sh script from the wsj eg
# This is very much a toy example, intended to be for learning the ropes of 
# nnet2 training and testing in Kaldi. You will not get state-of-the-art
# results.
# The default parameters here are in general low, to make training and 
# testing faster on a CPU.

stage=1
experiment_dir=experiment/nnet2/nnet2_simple
num_threads=4
minibatch_size=128
unknown_phone=SPOKEN_NOISE # having these explicit is just something I did when
silence_phone=SIL          # I was debugging, they are now required by decode_simple.sh


. ./path.sh
. ./utils/parse_options.sh

进入一步跟踪，调用脚本 train_pnorm_fast.sh(steps/nnet2/train_pnorm_fast.sh)，内容类似如下：

tip: 我的项目中 align 目录位于 run_nnet2.sh

if [ $stage -le 1 ]; then

    echo ""
    echo "######################"
    echo "### BEGIN TRAINING ###"
    echo "######################"

    mkdir -p $experiment_dir

    steps/nnet2/train_simple.sh \
        --stage -10 \
        --num-threads "$num_threads" \
        --feat-type raw \
        --splice-width 4 \
        --lda_dim 65 \
        --num-hidden-layers 2 \
        --hidden-layer-dim 50 \
        --add-layers-period 5 \
        --num-epochs 10 \
        --iters-per-epoch 2 \
        --initial-learning-rate 0.02 \
        --final-learning-rate 0.004 \
        --minibatch-size "$minibatch_size" \
        data/train \
        data/lang \
        experiment/triphones_aligned \
        $experiment_dir \
        || exit 1;

    echo ""
    echo "####################"
    echo "### END TRAINING ###"
    echo "####################"

正如你所看到的：主要含有如下参数：

the training data
the language dir
our alignments from our previous GMM-HMM model
the name of the dir where we will save our new DNN model

训练完成，测试部分如下：

if [ $stage -le 2 ]; then

    echo ""
    echo "#####################"
    echo "### BEGIN TESTING ###"
    echo "#####################"

    steps/nnet2/decode_simple.sh \
        --num-threads "$num_threads" \
        --beam 8 \
        --max-active 500 \
        --lattice-beam 3 \
        experiment/triphones/graph \
        data/test \
        $experiment_dir/final.mdl
        $unknown_phone \
        $silence_phone \
        $experiment_dir/decode \
        || exit 1;

    for x in ${experiment_dir}/decode*; do
        [ -d $x ] && grep WER $x/wer_* | \
            utils/best_wer.sh > nnet2_simple_wer.txt;
    done

    echo ""
    echo "###################"
    echo "### END TESTING ###"
    echo "###################"

fi

解码部分一般由如下 6 部分结成：

the original decoding graph from your GMM-HMM
dir for your test data
the final, trained DNN acoustic model
the “unknown” phone (eg. UNK)
the “silence” phone (eg. SIL)
new dir to save decoding information in (lattices, etc)

上面训练出的模型可能没有得较低的识别率，可以通过调整参数、添加更复杂的非线性函数、尝试不同的加权、以及一些 CMVN 或者说话人自适应等。

3) 主要的脚本

首先，steps/nnet2/train_pnorm_fast.sh 中的一些默认参数设置：

#!/bin/bash

# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey). 
#           2013  Xiaohui Zhang
#           2013  Guoguo Chen
#           2014  Vimal Manohar
# Apache 2.0.
#

# Begin configuration section.
cmd=run.pl
stage=-4
num_epochs=15      # Number of epochs of training
initial_learning_rate=0.04
final_learning_rate=0.004
bias_stddev=0.5
hidden_layer_dim=0
add_layers_period=2 # by default, add new layers every 2 iterations.
num_hidden_layers=3
minibatch_size=128 # by default use a smallish minibatch size for neural net
                   # training; this controls instability which would otherwise
                   # be a problem with multi-threaded update. 
num_threads=4   # Number of jobs to run in parallel.
splice_width=4 # meaning +- 4 frames on each side for second LDA
lda_dim=40
feat_type=raw  # raw, untransformed features (probably MFCC or PLP)
iters_per_epoch=5

. ./path.sh || exit 1; # make sure we have a path.sh script
. ./utils/parse_options.sh || exit 1;

当完成上面命令行的解析后，接下来就确认做为 DNN 训练时由 GMM-HMM 训练产的文件。

data_dir=$1
lang_dir=$2
ali_dir=$3
exp_dir=$4

# Check some files from our GMM-HMM system
for f in \
    $data_dir/feats.scp \
    $lang_dir/topo \
    $ali_dir/ali.1.gz \
    $ali_dir/final.mdl \
    $ali_dir/tree \
    $ali_dir/num_jobs;
    do [ ! -f $f ] && echo "$0: no such file $f" && exit 1;
done

一旦确认完上述文件后，接下来就是从这些文件中提取 “参数信息”

# Set number of leaves
num_leaves=`tree-info $ali_dir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1;

# set up some dirs and parameter definition files
nj=`cat $ali_dir/num_jobs` || exit 1;
echo $nj > $exp_dir/num_jobs
cp $ali_dir/tree $exp_dir/tree
mkdir -p $exp_dir/log

上面的脚本定义了一连串的变量，创建了两个文件 tree(从GMM-HMM中拷贝而来)，num_jobs, 并创建了一个空 log 目录，目录结构如下：

experiment/nnet2/
└── nnet2_simple
    ├── log
    ├── num_jobs
    └── tree

接下来，进入训练前数据准备部分，通过脚本 local/train_mllt.sh 估计 LDA 特征变换，这些特征 transformation matrix 将用于 DNN 输入前的 spliced features (拼接特征)。

if [ $stage -le -5 ]; then

    echo ""
    echo "###############################"
    echo "### BEGIN GET LDA TRANSFORM ###"
    echo "###############################"

    steps/nnet2/get_lda_simple.sh \
        --cmd "$cmd" \
        --lda-dim $lda_dim \
        --feat-type $feat_type \
        --splice-width $splice_width \
        $data_dir \
        $lang_dir \
        $ali_dir \
        $exp_dir \
        || exit 1;

    # these files should have been written by get_lda.sh
    feat_dim=$(cat $exp_dir/feat_dim) || exit 1;
    lda_dim=$(cat $exp_dir/lda_dim) || exit 1;
    lda_mat=$exp_dir/lda.mat || exit;

    echo ""
    echo "#############################"
    echo "### END GET LDA TRANSFORM ###"
    echo "#############################"
fi

上面的脚本将输出 LDA transform matrix，当初始化神经网络时，位于 input layer 的拼接之后，该矩阵将用于 DNN 的 “ FixedAffineComponent”，也就是说：一旦我位得到 LDA transform，它将被应用到所有的 input，由于它是 FixedComponent，所以 LDA transform matrix 将不会被 back-propagation (反向传播)更新。产生的输出如下：

experiment/nnet2/
└── nnet2_simple
    ├── feat_dim
    ├── lda.1.acc
    ├── lda.2.acc
    ├── lda.3.acc
    ├── lda.4.acc
    ├── lda.acc
    ├── lda_dim
    ├── lda.mat
    ├── log
    │   ├── lda_acc.1.log
    │   ├── lda_acc.2.log
    │   ├── lda_acc.3.log
    │   ├── lda_acc.4.log
    │   ├── lda_est.log
    │   └── lda_sum.log
    ├── num_jobs
    └── tree

上面得到了 LDA transform，接下来进行 format training data，在脚本“get_egs.sh” 中，我们将训练数据分成了 training 和 validation，validation 用于在训练迭代中的 diagnostics

为了简单，以下脚本将 validation 和 diagnostics 放在了一起，以只有训练数据和格式化部分，没有将其分成各个子集进行 diagnostic （诊断）

if [ $stage -le -4 ]; then

    echo ""
    echo "###################################"
    echo "### BEGIN GET TRAINING EXAMPLES ###"
    echo "###################################"

    steps/nnet2/get_egs_simple.sh \
        --cmd "$cmd" \
        --feat-type $feat_type \
        --splice-width $splice_width \
        --num-jobs-nnet $num_threads \
        --iters-per-epoch $iters_per_epoch \
        $data_dir \
        $ali_dir \
        $exp_dir \
        || exit 1;

    # this is the path to the new egs dir that was just created
    egs_dir=$exp_dir/egs

    echo ""
    echo "#################################"
    echo "### END GET TRAINING EXAMPLES ###"
    echo "#################################"

fi

运行上述脚本，将输出新目录，结构如下：

experiment/nnet2/
└── nnet2_simple
    ├── egs
    │   ├── egs.1.0.ark
    │   ├── egs.1.1.ark
    │   ├── egs.2.0.ark
    │   ├── egs.2.1.ark
    │   ├── egs.3.0.ark
    │   ├── egs.3.1.ark
    │   ├── egs.4.0.ark
    │   ├── egs.4.1.ark
    │   ├── iters_per_epoch
    │   └── num_jobs_nnet
    ├── feat_dim
    ├── lda.1.acc
    ├── lda.2.acc
    ├── lda.3.acc
    ├── lda.4.acc
    ├── lda.acc
    ├── lda_dim
    ├── lda.mat
    ├── log
    │   ├── get_egs.1.log
    │   ├── get_egs.2.log
    │   ├── get_egs.3.log
    │   ├── get_egs.4.log
    │   ├── lda_acc.1.log
    │   ├── lda_acc.2.log
    │   ├── lda_acc.3.log
    │   ├── lda_acc.4.log
    │   ├── lda_est.log
    │   ├── lda_sum.log
    │   ├── shuffle.0.1.log
    │   ├── shuffle.0.2.log
    │   ├── shuffle.0.3.log
    │   ├── shuffle.0.4.log
    │   ├── shuffle.1.1.log
    │   ├── shuffle.1.2.log
    │   ├── shuffle.1.3.log
    │   ├── shuffle.1.4.log
    │   ├── split_egs.1.log
    │   ├── split_egs.2.log
    │   ├── split_egs.3.log
    │   └── split_egs.4.log
    ├── num_jobs
    └── tree

到目前为止，我们已经将训练样本（phone-to-frame alignments）正确的格式化，并排序，接下来进行神经网络的初始化。

类似将 topo 配置文件应用于 GMM-HMM 训练中，在初始化神经网络之前，我们需要神经网络的尺寸和结构，相关信息位于配置文件“exp/tri4-si/nnet.config”中，详细信息如下：

SpliceComponent input-dim=$feat_dim left-context=$splice_width right-context=$splice_width 
FixedAffineComponent matrix=$lda_mat 
AffineComponent input-dim=$lda_dim output-dim=$hidden_layer_dim learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev 
TanhComponent dim=$hidden_layer_dim 
AffineComponent input-dim=$hidden_layer_dim output-dim=$num_leaves learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev 
SoftmaxComponent dim=$num_leaves

各层的含义如下：

SpliceComponent defines the size of the window of feature-frame-splicing to perform.
FixedAffineComponent is our LDA-like transform created by get_lda_simple.sh.
AffineComponent is the standard Wx+b affine transform found in neural nets. This first AffineComponent represents the weights and biases between the input layer and the first hidden layer.
TanhComponent is the standard tanh nonlinearity.
AffineComponent is the standard Wx+b affine transform found in neural nets. This second AffineComponent represents the weights and biases between the hidden layer and the output layer.
SoftmaxComponent is the final nonlinearity that produces properly normalized probabilities at the output.

    SpliceComponent: 定义了完成 feature-frame-splicing 的窗口尺寸(以中间 frame 为轴，左右各四个 frame，共9帧为单位组合后做为输入(通常由 MFCC+splice+LDA+MLLT+fMLLR 组成的 40 维特征，splicing width = 4 是最优的)
    FixedAffineComponent：类 LDA-like 的非相关转换，由标准的 weight matrix plus bias 组成，通过标准的 stochastic gradient descent 训练而来，使用 global learning rate
    AffineComponentPreconditionedOnline：为 FixedAffineComponent 的一种提炼，训练过程中不仅使用global learning rate，还使用 matrix-valued learning rate 来预处理梯度下降。参见 dnn2_preconditioning。
    PnormComponent：为非线性，传统的神经网络模型中使用 TanhComponent
    NormalizeComponent：用于稳定训练 p-norm 网络，它是固定的，非可训练，非线性的。它不是在个别 individual activations 上起作用，而是对单帧的整个 vetor 起作用，重新使它们单位标准化。
    SoftmaxComponent：为最终的非线性特征，便于输出标准概率

上述初始化 DNN 配置文件一个隐藏层

  也就是说：有 6 个 Kaldi components, 但网络中仅有 3 层

  因此，只有 2 个可更新 weight matrices 和可更新 bias vectors。如果回顾一下 nnet.config 文件中的定义，确实只有 2 个可更新的 components，都是 AffineComponent 形式

  而 hidden layer 的定义位于文件“exp/tri4-si/hidden.config”

  内容如下：

AffineComponent input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev 
TanhComponent dim=$hidden_layer_dim

再一次，我们发现 affine transform 之后紧跟着一个 non-linearity。

现在，我们可以使用 decision tree、HMM topology file 和 nnet.config file，并初始化第一个神经网络，即 0.mdl，如下：

$cmd $exp_dir/log/nnet_init.log \
 nnet-am-init \
 $ali_dir/tree \
 $lang_dir/topo \
 "nnet-init $exp_dir/nnet.config -|" \
 $exp_dir/0.mdl \
 || exit 1;

接下来 “check-in” 来看看都产生了哪些文件，如下：

nnet2/
└── nnet2_simple
    ├── 0.mdl
    ├── hidden.config
    ├── log
    │   └── nnet_init.log
    └── nnet.config

此外，我们可以看一下未训练的模型，并通过使用 nnet-am-info 获取 exp/tri4-si/0.mdl 一些信息

num-components 6
num-updatable-components 2
left-context 4
right-context 4
input-dim 13
output-dim 1759
parameter-dim 181759
component 0 : SpliceComponent, input-dim=13, output-dim=117, context=-4 -3 -2 -1 0 1 2 3 4 
component 1 : FixedAffineComponent, input-dim=117, output-dim=40, linear-params-stddev=0.0146923, bias-params-stddev=2.91086
component 2 : AffineComponent, input-dim=40, output-dim=100, linear-params-stddev=0.100784, bias-params-stddev=0.49376, learning-rate=0.02
component 3 : TanhComponent, input-dim=100, output-dim=100
component 4 : AffineComponent, input-dim=100, output-dim=1759, linear-params-stddev=0, bias-params-stddev=0, learning-rate=0.02
component 5 : SoftmaxComponent, input-dim=1759, output-dim=1759
prior dimension: 0

现在我们已经有一个初始化的模型和标记的训练示例，我们可以在DNN-HMM声学模型中训练 HMM's transitions，在GMM-HMM 训练过程中，在 EM training 阶段更新 transitions，但由于我们没有对DNN训练进行任何 realignment，因此初始转移概率将会很好。

$cmd $exp_dir/log/train_trans.log \
    nnet-train-transitions \
        $exp_dir/0.mdl \
        "ark:gunzip -c $ali_dir/ali.*.gz|" \
        $exp_dir/0.mdl \
        || exit 1;

由于我们使用原始的 0.mdl 进行 “seeding”，并将其重命名为 0.mdl，上述命令生成唯一的日志文件“train_trans.log”

Kaldi 的 nnet-train-transitions 作用如下：用于计算 HMMs decoding 时的“转移概率”（与神经网络本身无关），并计算“targets”(several thousand contex-dependent states)的 prior probabilities（先验概率）。之后，当进行解码时，将这些通过网络计算出的先验概率划分为 “pseudo-likehoods”（伪似然），这样一来，比原始的先验概率更加兼容 HMM framwork。

通过通过 nnet-am-info 查看 exp/tri4-si/0.mdl 这些先验概率，如下：

nnet-am-info 0.mdl 
num-components 6
num-updatable-components 2
left-context 4
right-context 4
input-dim 13
output-dim 1759
parameter-dim 181759
component 0 : SpliceComponent, input-dim=13, output-dim=117, context=-4 -3 -2 -1 0 1 2 3 4 
component 1 : FixedAffineComponent, input-dim=117, output-dim=40, linear-params-stddev=0.0146923, bias-params-stddev=2.91086
component 2 : AffineComponent, input-dim=40, output-dim=100, linear-params-stddev=0.100784, bias-params-stddev=0.49376, learning-rate=0.02
component 3 : TanhComponent, input-dim=100, output-dim=100
component 4 : AffineComponent, input-dim=100, output-dim=1759, linear-params-stddev=0, bias-params-stddev=0, learning-rate=0.02
component 5 : SoftmaxComponent, input-dim=1759, output-dim=1759
prior dimension: 1759, prior sum: 1, prior min: 1.68406e-05

接下来进入主训练循环阶段，该阶段利用 backpropagation (反向传播)进行“参数更新”

if [ $stage -le -2 ]; then

    echo ""
    echo "#################################"
    echo "### BEGIN TRAINING NEURAL NET ###"
    echo "#################################"
    
    # get some info on iterations and number of models we're training
    iters_per_epoch=`cat $egs_dir/iters_per_epoch` || exit 1;
    num_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;
    num_tot_iters=$[$num_epochs * $iters_per_epoch]

    echo "Will train for $num_epochs epochs = $num_tot_iters iterations"
    
    # Main training loop
    x=0
    while [ $x -lt $num_tot_iters ]; do
            
        echo "Training neural net (pass $x)"
        
        # IF *not* first iteration \
        # AND we still have layers to add \
        # AND its the right time to add a layer
        if [ $x -gt 0 ] \
            && [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] \
            && [ $[($x-1) % $add_layers_period] -eq 0 ]; 
        then
            echo "Adding new hidden layer"
            mdl="nnet-init --srand=$x $exp_dir/hidden.config - |"
            mdl="$mdl nnet-insert $exp_dir/$x.mdl - - |" 
        else
            # otherwise just use the past model
            mdl=$exp_dir/$x.mdl
        fi
        
        # Shuffle examples and train nets with SGD
        $cmd JOB=1:$num_jobs_nnet $exp_dir/log/train.$x.JOB.log \
            nnet-shuffle-egs \
                --srand=$x \
                ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark \
                ark:- \| \
            nnet-train-parallel \
                --num-threads=$num_threads \
                --minibatch-size=$minibatch_size \
                --srand=$x \
                "$mdl" \
                ark:- \
                $exp_dir/$[$x+1].JOB.mdl \
                || exit 1;
        
        # Get a list of all the nnets which were run on different jobs
        nnets_list=
        for n in `seq 1 $num_jobs_nnet`; do
            nnets_list="$nnets_list $exp_dir/$[$x+1].$n.mdl"
        done
        
        learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_tot_iters $initial_learning_rate $final_learning_rate`;
        
        # Average all SGD-trained models for this iteration
        $cmd $exp_dir/log/average.$x.log \
            nnet-am-average \
                $nnets_list - \| \
            nnet-am-copy \
                --learning-rate=$learning_rate \
                - \
                $exp_dir/$[$x+1].mdl \
                || exit 1;
        
        # on to the next model
        x=$[$x+1]
        
    done;
    
    # copy and rename final model as final.mdl
    cp $exp_dir/$x.mdl $exp_dir/final.mdl
    
    echo ""
    echo "################################"
    echo "### DONE TRAINING NEURAL NET ###"
    echo "################################"
    
fi

  上述过程中，主要的训练在这个 loop: “nnet-train-parallel”

nnet-train-parallel 功能：使用 minibatches （小批量）数据来训练具有反向和随机梯度下降的神经网络参数，对于 nnet-train-simple，在Hogwild 类型的更新中使用多线程（CPU,而不是GPU）

因此，使用这种并行化的训练程序，我们实际上会为每个迭代训练多个 DNN。

  正如你看到的 log 文件 “train.$x.JOB.log”，$x 表示迭代次数，JOB 为 job number。由于我的笔记本电脑上只有四个处理器，我在每个迭代运行 4 个 jobs。这也意味着每次迭代我都要以巧妙的方式合并这 4 个网络，或者选择最好的网络。

  原始的 train_pnorm_simple 脚本的设置方式为：平均/选择最佳模型。理由如下：

  在某些迭代中，模型是不稳定的，因此，最好的方法是选择最好的，而不是选择平均。

  其中，不稳定的迭代指的是“第 1 次迭代，或者当新增隐藏层时的迭代”。

  Kaldi 中 tip：

  在迭代为 0 时，或者我们刚添加一个 layer，使用 minibatch size 和仅一个 job: 当模型变化太快时，model-averaging 似乎没有什么好处（即，恶化目标函数），而较小尺寸的 minibatch size 将保持更新稳定

  我已经从train_simple.sh中删除了“选择最佳工作”选项。这肯定会导致不稳定，但它简化了训练过程，使流程更清晰。

  此外，在原始脚本中，我们通常可以选择混合神经网络中的组件数量。但是，为了尽可能地减少网络，我去掉了混合选项。

会飞行的小蜗牛

关注

2
点赞
踩
49

收藏

觉得还不错? 一键收藏
4
评论
如何用 Kaldi 训练一个 DNN 声学模型

1. 介绍：首先，需要完成标准的 GMM-HMM 声学模型的训练训练 monophone model 是通过 GMM-HMM System 做 utterance-level transcriptions，即训练 label-audio 的映射训练 triphone model 是通过 GMM-HMM System 做 phoneme-to-audio gli
复制链接

扫一扫