kaldi mmi/bmmi/mpe/smbr训练及源码


训练流程

  1. make_denlats.sh
  • 产生1gram语言模型对应的解码网络
  • nnet-latgen-faster产生denominator lattice,作为smbr训练的分母,充当竞争路径的作用
  1. align.sh
  • 根据标注的reference作为G构建WFST
  • 解码获得最优路径,产生每一帧对齐的最好结果numerator alignment,作为smbr训练的分子,充当标准路径的作用
  1. get_egs_discriminative2.sh
    重新组织数据格式,将上面两步产生的numerator alignment、denominator lattice和特征文件input features组织成以下形式的数据结构(必要时进行split和excise操作):
struct DiscriminativeNnetExample {
  /// The weight we assign to this example;
  /// this will typically be one, but we include it
  /// for the sake of generality.  
  BaseFloat weight; 

  /// The numerator alignment
  std::vector<int32> num_ali; 

  /// The denominator lattice.  Note: any acoustic
  /// likelihoods in the denominator lattice will be
  /// recomputed at the time we train.
  CompactLattice den_lat; 

  /// The input data-- typically with a number of frames [NumRows()] larger than
  /// labels.size(), because it includes features to the left and right as
  /// needed for the temporal context of the network.  (see also the
  /// left_context variable).
  /// Caution: when we write this to disk, we do so as CompressedMatrix.
  /// Because we do various manipulations on these things in memory, such
  /// as splitting, we don't want it to be a CompressedMatrix in memory
  /// as this would be wasteful in time and also would lead to further loss of
  /// accuracy.
  Matrix<BaseFloat> input_frames;

  /// The number of frames of left context in the features (we can work out the
  /// #frames of right context from input_frames.NumRows(), num_ali.size(), and
  /// this).
  int32 left_context;


  /// spk_info contains any component of the features that varies slowly or not
  /// at all with time (and hence, we would lose little by averaging it over
  /// time and storing the average).  We'll append this to each of the input
  /// features, if used.
  Vector<BaseFloat> spk_info;
}
  1. train_discriminative2.sh
  • nnet-combine-egs-discriminative:重新组织egs,512帧组成1个eg,对应于一个batch
  • 开始区分度训练,最外层的循环控制使用的iter个数,也就是把epoch统一转化为iter的个数
    num_archive=5000 (比如degs产生的文件个数)
    num_jobs_nnet=4
    num_epochs=4
    num_iters=num_epochs*num_archive/num_jobs_nnet=5000

kaldi源码

数据部分

数据部分egs_包括

  • num_ali
    维度:帧数
    align后的序列,作为reference
  • den_lat
    对应的lattice
  • input_frames
    维度:(left_context+帧数+right_context)*frame_dim
    语音特征

模型部分包括:

  • am_nnet
    用来计算P(s|o),以及训练更新

训练流程

  1. Propagate()
    计算特征对应的后验概率P(s|o)
  2. LatticeComputation()
    计算准则对应的loss
  • 使用上一步计算出的P(s|o),除以先验概率P(s),得到似然概率P(o|s),替换lattice边上对应的P(o|s)
  • 根据不同的准则计算对应的 p o s t = ∂ J ∂ l o g P ( o t ∣ s ) post=\frac{\partial J}{\partial logP(o_t|s)} post=logP(ots)J
    <1> LatticeForwardBackwardMpeVariants
    计算 p o s t = ∂ J ∂ l o g P ( o t ∣ s ) = r q ( c ( q ) − c a r g r ) post=\frac{\partial J}{\partial logP(o_t|s)}=r_q(c(q)-c_{arg}^r) post=logP(ots)J=rq(c(q)cargr)
    r q r_q rq: 表示边q的似然概率,对应于 α q β q ∑ r α r β r \frac{\alpha_q \beta_q}{\sum_r \alpha_r \beta_r} rαrβrαqβq
    c ( q ) c(q) c(q): 表示经过边q的所有句子的平均state准确率,对应于alpha_smbr[q]+beta_smbr[q]
    c a r g r c_{arg}^r cargr: 所有句子的平均state准确率,对应于tot_forward_score
    <2>LatticeForwardBackwardMmi
    LatticeForwardBackward:对应于第二项 ∑ w : s t = i p ( o m ∣ s m ) k P ( w ) ∑ w p ( o m ∣ s m ) k P ( w ) ) \frac{\sum_{w:s_t=i} p(o^m|s^m)^k P(w)}{\sum_w p(o^m|s^m)^k P(w)}) wp(omsm)kP(w)w:st=ip(omsm)kP(w))
    AlignmentToPosterior:对应于第一项 δ ( i = s t m ) \delta(i=s_t^m) δ(i=stm)
  • CompObjfAndDeriv
    计算对于后验概率的导数,满足如下关系
    ∂ J ∂ P ( s ∣ o t ) = ∂ J ∂ l o g P ( o t ∣ s ) ∂ l o g P ( o t ∣ s ) ∂ P ( s ∣ o t ) = p o s t 1 P ( s ∣ o t ) \frac{\partial J}{\partial P(s|o_t)}=\frac{\partial J}{\partial logP(o_t|s)} \frac{\partial logP(o_t|s)}{\partial P(s|o_t)}=post \frac{1}{P(s|o_t)} P(sot)J=logP(ots)JP(sot)logP(ots)=postP(sot)1
  1. Backprop()
    逐层进行反向传播

公式推导

详细公式推导参考这片博客
smbr训练实际用到的前后向算法,参考[1]的“Computation for approximate MPE”部分的伪代码,其中 α \alpha α β \beta β比较好理解,但是为了 A ( s , s m ) A(s,s^m) A(s,sm)的计算引入了 α ′ \alpha' α β ′ \beta' β不容易理解,称之为到达边的不完整序列的state平均准确率。

其他点

  1. lattice对应的边上面有两个weight
  • graph cost: LM+transition+pronunciation
  • acoustic cost: -P(o|s)
  1. mpe和smbr的主要体现在 A ( s , s m ) A(s,s^m) A(s,sm)的计算上
  • mpe: 每一帧对应的phone和align是否一致
  • smbr: 每一帧对应的state和align是否一致
  1. one_silience_class参数含义体现在对于 A ( s , s m ) A(s,s^m) A(s,sm)的认定上面
  • true: pdf=ref_pdf或者pdf和ref_pdf都属于silience的话,都认为相等
  • false: pdf=ref_pdf并且pdf不是silience的话,才认为相等
  1. drop_frames
    根据[2]: mmi训练当分子alignment的状态没有在分母的lattice出现的时候,导致梯度过大,舍弃该帧的梯度。这种情况大部分发生在帧数低于50帧的语音,可以使用rejected frame(drop_frames)来解决这种问题。导致这种问题的原因包括
  • search error
  • poor match of the acoustics to the model
  • errors in the reference

经验设置

根据[1]的总结:

  • lattice的beam设定较小的话影响准确率(MPE相比于MMI受影响更小),而且迭代过程中重新产生lattice的收益很小,而且十分耗时
  • 使用1gram的语言模型
  • am scale跟解码保持一致即可,有时候更小(比如原来的1/4)会有收益

smbr只有在大数据量才会有效果

参考文献

[1]. discriminative training for large vocabulary speech recognition
[2]. sequence-discriminative training of deep neural networks

后面的技术分享转移到微信公众号上面更新了,【欢迎扫码关注交流】

在这里插入图片描述

  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值