DNN训练使用的CE准则是基于每一帧进行分类的优化,最小化帧错误率,但是实际上语音识别是一个序列分类的问题,更关心的是序列的准确性。所以引入SDT(sequence-discriminative training),训练准则更符合实际,有利于提升识别率。常用的准则包括MMI/BMMI、MPE、MBR等。
准则 | 目标函数 |
---|---|
CE | 帧错误率 |
MMI/BMMI | 句子正确率 |
MPE | phone错误率 |
sMBR | 状态错误率 |
MMI
MMI(maximum mutual information)准则最大化观察序列分布和word序列分布之间的互信息,减小句子错误率。
假设观察序列
o
m
=
o
1
m
,
.
.
.
,
o
T
m
m
o^m=o_1^m,...,o_{T_m}^m
om=o1m,...,oTmm,word序列
w
m
=
w
1
m
,
.
.
.
,
w
N
m
m
w^m=w_1^m,...,w_{N_m}^m
wm=w1m,...,wNmm,其中
m
m
m表示utterance,
T
m
T_m
Tm表示帧数,
N
m
N_m
Nm表示word个数。训练集为
S
=
{
(
o
m
,
w
m
)
∣
0
≤
m
≤
M
}
,
S=\{(o^m,w^m)|0\le m\le M\},
S={(om,wm)∣0≤m≤M},MMI准则可以表示如下:
J
M
M
I
(
θ
;
S
)
=
∑
m
=
1
M
J
M
M
I
(
θ
;
o
m
,
w
m
)
=
∑
m
=
1
M
l
o
g
P
(
w
m
∣
o
m
;
θ
)
J_{MMI}(\theta;S)=\sum_{m=1}^MJ_{MMI}(\theta;o^m,w^m)=\sum_{m=1}^MlogP(w^m|o^m;\theta)
JMMI(θ;S)=m=1∑MJMMI(θ;om,wm)=m=1∑MlogP(wm∣om;θ)
=
∑
m
=
1
M
l
o
g
p
(
o
m
∣
s
m
;
θ
)
k
P
(
w
m
)
∑
w
p
(
o
m
∣
s
w
;
θ
)
k
P
(
w
)
=\sum_{m=1}^Mlog\frac{p(o^m|s^m;\theta)^kP(w^m)}{\sum_w p(o^m|s^w;\theta)^k P(w)}
=m=1∑Mlog∑wp(om∣sw;θ)kP(w)p(om∣sm;θ)kP(wm)
其中
k
k
k表示acoustic scale,
θ
\theta
θ表示模型参数,
s
m
s^m
sm表示状态序列。物理意义可以理解为:分子表示准确结果对应路径的总得分(声学和语言),分母表示所有路径对应的得分总和(为了计算上的可操作性,实际用lattice简化表示)。模型参数的梯度可以表示如下:
∇
J
M
M
I
(
θ
;
o
m
,
w
m
)
=
∑
m
∑
t
∇
z
m
t
L
J
M
M
I
(
θ
;
o
m
,
w
m
)
∂
z
m
t
L
∂
θ
=
∑
m
∑
t
e
¨
m
t
L
∂
z
m
t
L
∂
θ
\nabla{J_{MMI}(\theta;o^m,w^m)}=\sum_m\sum_t \nabla_{z_{mt}^L}{J_{MMI}(\theta;o^m,w^m)} \frac{\partial z_{mt}^L}{\partial \theta}=\sum_m\sum_t \ddot{e}_{mt}^L\frac{\partial z_{mt}^L}{\partial \theta}
∇JMMI(θ;om,wm)=m∑t∑∇zmtLJMMI(θ;om,wm)∂θ∂zmtL=m∑t∑e¨mtL∂θ∂zmtL
其中
z
m
t
L
z_{mt}^L
zmtL表示softmax层的输入(没有做softmax运算),跟CE准则的不同体现在
e
¨
m
t
L
\ddot{e}_{mt}^L
e¨mtL,进一步计算如下:
e
¨
m
t
L
(
i
)
=
∇
z
m
t
L
(
i
)
J
M
M
I
(
θ
;
o
m
,
w
m
)
\ddot{e}_{mt}^L(i)=\nabla_{z_{mt}^L(i)}{J_{MMI}(\theta;o^m,w^m)}
e¨mtL(i)=∇zmtL(i)JMMI(θ;om,wm)
=
∑
r
∂
J
M
M
I
(
θ
;
o
m
,
w
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
z
m
t
L
(
i
)
=\sum_r \frac{\partial J_{MMI}(\theta;o^m,w^m)}{\partial logp(o_t^m|r)}\frac{\partial logp(o_t^m|r)}{\partial z_{mt}^L(i)}
=r∑∂logp(otm∣r)∂JMMI(θ;om,wm)∂zmtL(i)∂logp(otm∣r)
第一部分
∂
J
M
M
I
(
θ
;
o
m
,
w
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
\frac{\partial J_{MMI}(\theta;o^m,w^m)}{\partial logp(o_t^m|r)}
∂logp(otm∣r)∂JMMI(θ;om,wm)
=
∂
l
o
g
p
(
o
m
∣
s
m
)
k
P
(
w
m
)
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=\frac{\partial log\frac{p(o^m|s^m)^kP(w^m)}{\sum_w p(o^m|s^w)^k P(w)}}{\partial logp(o_t^m|r)}
=∂logp(otm∣r)∂log∑wp(om∣sw)kP(w)p(om∣sm)kP(wm)
=
k
∂
l
o
g
p
(
o
m
∣
s
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
−
∂
l
o
g
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=k\frac{\partial log p(o^m|s^m)}{\partial log p(o_t^m|r)}-\frac{\partial log\sum_w p(o^m|s^w)^k P(w)}{\partial logp(o_t^m|r)}
=k∂logp(otm∣r)∂logp(om∣sm)−∂logp(otm∣r)∂log∑wp(om∣sw)kP(w)
考虑到
p
(
o
m
∣
s
m
)
=
p
(
o
1
m
∣
s
1
m
)
p
(
o
2
m
∣
s
2
m
)
.
.
.
p
(
o
T
m
m
∣
s
T
m
m
)
p(o^m|s^m)=p(o_1^m|s_1^m)p(o_2^m|s_2^m)...p(o_{T_m}^m|s_{T_m}^m)
p(om∣sm)=p(o1m∣s1m)p(o2m∣s2m)...p(oTmm∣sTmm),所以上式第一项可以简化为:
k
∂
p
(
o
m
∣
s
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=
k
(
δ
(
r
=
s
t
m
)
)
k\frac{\partial p(o^m|s^m)}{\partial logp(o_t^m|r)}=k(\delta(r=s_t^m))
k∂logp(otm∣r)∂p(om∣sm)=k(δ(r=stm))
第二项可以进一步求导:
∂
l
o
g
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
\frac{\partial log\sum_w p(o^m|s^w)^k P(w)}{\partial logp(o_t^m|r)}
∂logp(otm∣r)∂log∑wp(om∣sw)kP(w)
=
∂
l
o
g
∑
w
e
l
o
g
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=\frac{\partial log\sum_w e^{logp(o^m|s^w)^k P(w)}}{\partial logp(o_t^m|r)}
=∂logp(otm∣r)∂log∑welogp(om∣sw)kP(w)
=
1
∑
w
e
l
o
g
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
∑
w
e
l
o
g
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=\frac{1}{\sum_w e^{logp(o^m|s^w)^k P(w)}}\frac{\partial \sum_w e^{log p(o^m|s^w)^k P(w)}}{\partial logp(o_t^m|r)}
=∑welogp(om∣sw)kP(w)1∂logp(otm∣r)∂∑welogp(om∣sw)kP(w)
=
1
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∗
∑
w
e
l
o
g
p
(
o
m
∣
s
w
)
k
P
(
w
)
∗
∂
l
o
g
p
(
o
m
∣
s
w
)
k
P
(
w
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=\frac{1}{\sum_w p(o^m|s^w)^k P(w)}*\sum_w e^{logp(o^m|s^w)^k P(w)}*\frac{\partial log p(o^m|s^w)^k P(w)}{\partial logp(o_t^m|r)}
=∑wp(om∣sw)kP(w)1∗w∑elogp(om∣sw)kP(w)∗∂logp(otm∣r)∂logp(om∣sw)kP(w)
=
1
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∗
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
∗
δ
(
s
t
m
=
r
)
=\frac{1}{\sum_w p(o^m|s^w)^k P(w)}*\sum_w p(o^m|s^w)^k P(w)*\delta (s_t^m=r)
=∑wp(om∣sw)kP(w)1∗w∑p(om∣sw)kP(w)∗δ(stm=r)
=
∑
w
:
s
t
=
r
p
(
o
m
∣
s
w
)
k
P
(
w
)
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
=\frac{\sum_{w:s_t=r} p(o^m|s^w)^k P(w)}{\sum_w p(o^m|s^w)^k P(w)}
=∑wp(om∣sw)kP(w)∑w:st=rp(om∣sw)kP(w)
综合前面的第一项和第二项,可得:
∂
J
M
M
I
(
θ
;
o
m
,
w
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=
k
(
δ
(
r
=
s
t
m
)
−
∑
w
:
s
t
=
r
p
(
o
m
∣
s
m
)
k
P
(
w
)
∑
w
p
(
o
m
∣
s
m
)
k
P
(
w
)
)
\frac{\partial J_{MMI}(\theta;o^m,w^m)}{\partial logp(o_t^m|r)}=k(\delta(r=s_t^m)-\frac{\sum_{w:s_t=r} p(o^m|s^m)^k P(w)}{\sum_w p(o^m|s^m)^k P(w)})
∂logp(otm∣r)∂JMMI(θ;om,wm)=k(δ(r=stm)−∑wp(om∣sm)kP(w)∑w:st=rp(om∣sm)kP(w))
第二部分
考虑到
p
(
x
∣
y
)
∗
p
(
y
)
=
p
(
y
∣
x
)
∗
p
(
x
)
p(x|y)*p(y)=p(y|x)*p(x)
p(x∣y)∗p(y)=p(y∣x)∗p(x),第二部分可以表示如下:
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
z
m
t
L
(
i
)
\frac{\partial logp(o_t^m|r)}{\partial z_{mt}^L(i)}
∂zmtL(i)∂logp(otm∣r)
=
∂
l
o
g
p
(
r
∣
o
t
m
)
−
l
o
g
p
(
r
)
+
l
o
g
p
(
o
t
m
)
∂
z
m
t
L
(
i
)
=\frac{\partial log p(r|o_t^m)-logp(r)+logp(o_t^m)}{\partial z_{mt}^L(i)}
=∂zmtL(i)∂logp(r∣otm)−logp(r)+logp(otm)
=
∂
l
o
g
p
(
r
∣
o
t
m
)
∂
z
m
t
L
(
i
)
=\frac{\partial log p(r|o_t^m)}{\partial z_{mt}^L(i)}
=∂zmtL(i)∂logp(r∣otm)
其中
p
(
r
∣
o
t
m
)
p(r|o_t^m)
p(r∣otm)表示DNN的第r个输出,
p
(
r
∣
o
t
m
)
=
s
o
f
t
m
a
x
r
(
z
m
t
L
)
=
e
z
m
t
L
(
r
)
∑
j
e
z
m
t
L
(
j
)
p(r|o_t^m)=softmax_r(z_{mt}^L)=\frac{e^{z_{mt}^L(r)}}{\sum_j e^{z_{mt}^L(j)}}
p(r∣otm)=softmaxr(zmtL)=∑jezmtL(j)ezmtL(r)
所以,
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
z
m
t
L
(
i
)
=
δ
(
r
=
i
)
\frac{\partial logp(o_t^m|r)}{\partial z_{mt}^L(i)}=\delta(r=i)
∂zmtL(i)∂logp(otm∣r)=δ(r=i)
按照文章的推导应该得到这个结果,但是实际上分母还包含
z
m
t
L
(
i
)
z_{mt}^L(i)
zmtL(i),是不是做了近似认为分母是常量,这一步有疑问????
综合上面两部分,可以得到最终的公式:
e
¨
m
t
L
(
i
)
=
k
(
δ
(
i
=
s
t
m
)
−
∑
w
:
s
t
=
i
p
(
o
m
∣
s
m
)
k
P
(
w
)
∑
w
p
(
o
m
∣
s
m
)
k
P
(
w
)
)
\ddot{e}_{mt}^L(i)=k(\delta(i=s_t^m)-\frac{\sum_{w:s_t=i} p(o^m|s^m)^k P(w)}{\sum_w p(o^m|s^m)^k P(w)})
e¨mtL(i)=k(δ(i=stm)−∑wp(om∣sm)kP(w)∑w:st=ip(om∣sm)kP(w))
##Boosted MMI
J
B
M
M
I
(
θ
;
S
)
=
∑
m
=
1
M
J
B
M
M
I
(
θ
;
o
m
,
w
m
)
=
∑
m
=
1
M
l
o
g
P
(
w
m
∣
o
m
)
∑
w
P
(
w
∣
o
m
)
e
−
b
A
(
w
,
w
m
)
J_{BMMI}(\theta;S)=\sum_{m=1}^MJ_{BMMI}(\theta;o^m,w^m)=\sum_{m=1}^Mlog \frac{P(w^m|o^m)}{\sum_w P(w|o^m)e^{-bA(w,w^m)}}
JBMMI(θ;S)=m=1∑MJBMMI(θ;om,wm)=m=1∑Mlog∑wP(w∣om)e−bA(w,wm)P(wm∣om)
=
∑
m
=
1
M
l
o
g
P
(
o
m
∣
w
m
)
k
P
(
w
m
)
∑
w
P
(
o
m
∣
w
m
)
k
P
(
w
)
e
−
b
A
(
w
,
w
m
)
=\sum_{m=1}^Mlog \frac{P(o^m|w^m)^kP(w^m)}{\sum_w P(o^m|w^m)^k P(w)e^{-bA(w,w^m)}}
=m=1∑Mlog∑wP(om∣wm)kP(w)e−bA(w,wm)P(om∣wm)kP(wm)
相比于MMI,BMMI在分母上面增加了一个权重系数
e
−
b
A
(
w
,
w
m
)
e^{-bA(w,w^m)}
e−bA(w,wm),一般
b
=
0.5
b=0.5
b=0.5,
A
(
w
,
w
m
)
A(w,w^m)
A(w,wm)是
w
w
w和
w
m
w^m
wm之间准确率的度量,可以是word/phoneme/state级别的准确率。
物理意义:
参考[3]给出的解释,We boost the likelihood of the sentences that have more errors, thus generating more confusable data. Boosted MMI can viewed as trying to enforce a soft margin that is proportional to the number of errors in a hypothesised sentence。
结合参数理解,就是
w
w
w和
w
m
w^m
wm越接近(错误的word越少),
e
−
b
A
(
w
,
w
m
)
e^{-bA(w,w^m)}
e−bA(w,wm)这个权重越小,相反,权重会越大,增加了数据的困惑度。
通过可以推导出误差信号:
e
¨
m
t
L
(
i
)
=
k
(
δ
(
i
=
s
t
m
)
−
∑
w
:
s
t
=
i
p
(
o
m
∣
s
w
)
k
P
(
w
)
e
−
b
A
(
w
,
w
m
)
∑
w
p
(
o
m
∣
s
w
)
k
P
(
w
)
e
−
b
A
(
w
,
w
m
)
)
\ddot{e}_{mt}^L(i)=k(\delta(i=s_t^m)-\frac{\sum_{w:s_t=i} p(o^m|s^w)^k P(w) e^{-bA(w,w^m)}}{\sum_w p(o^m|s^w)^k P(w) e^{-bA(w,w^m)}})
e¨mtL(i)=k(δ(i=stm)−∑wp(om∣sw)kP(w)e−bA(w,wm)∑w:st=ip(om∣sw)kP(w)e−bA(w,wm))
MPE/sMBR
MBR(minimum Bayes risk)的目标函数是最小化各种粒度指标的错误,比如MPE是最小化phone级别的错误,sMBR最小化状态的错误。目标函数如下:
J
M
B
R
(
θ
;
S
)
=
∑
m
=
1
M
J
M
B
R
(
θ
;
o
m
,
w
m
)
=
∑
m
=
1
M
∑
w
P
(
w
∣
o
m
)
A
(
w
,
w
m
)
J_{MBR}(\theta;S)=\sum_{m=1}^MJ_{MBR}(\theta;o^m,w^m)=\sum_{m=1}^M \sum_w P(w|o^m) A(w,w^m)
JMBR(θ;S)=m=1∑MJMBR(θ;om,wm)=m=1∑Mw∑P(w∣om)A(w,wm)
=
∑
m
=
1
M
∑
w
P
(
o
m
∣
s
w
)
k
P
(
w
)
A
(
w
,
w
m
)
∑
w
′
P
(
o
m
∣
s
w
′
)
k
P
(
w
′
)
=\sum_{m=1}^M \frac{\sum_w P(o^m|s^w)^kP(w) A(w,w^m)}{\sum_{w'} P(o^m|s^{w'})^k P(w')}
=m=1∑M∑w′P(om∣sw′)kP(w′)∑wP(om∣sw)kP(w)A(w,wm)
其中
A
(
w
,
w
m
)
A(w,w^m)
A(w,wm)表示两个序列之间的差异,MPE就是正确的phone的个数,sMBR是指正确的state的个数。求导可得:
e
¨
m
t
L
(
i
)
=
∇
z
m
t
L
(
i
)
J
M
B
R
(
θ
;
o
m
,
w
m
)
\ddot{e}_{mt}^L(i)=\nabla_{z_{mt}^L(i)}{J_{MBR}(\theta;o^m,w^m)}
e¨mtL(i)=∇zmtL(i)JMBR(θ;om,wm)
=
∑
r
∂
J
M
B
R
(
θ
;
o
m
,
w
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
l
o
g
p
(
o
t
m
∣
r
)
∂
z
m
t
L
(
i
)
=\sum_r \frac{\partial J_{MBR}(\theta;o^m,w^m)}{\partial logp(o_t^m|r)}\frac{\partial logp(o_t^m|r)}{\partial z_{mt}^L(i)}
=r∑∂logp(otm∣r)∂JMBR(θ;om,wm)∂zmtL(i)∂logp(otm∣r)
###第一部分
对于MPE,参考文献[4]:
首先将
J
M
B
R
(
θ
;
o
m
,
s
m
)
J_{MBR}(\theta;o^m,s^m)
JMBR(θ;om,sm)分子分母求和部分分为两块,
r
∈
s
w
r\in s^w
r∈sw和
r
∉
s
w
r\notin s^w
r∈/sw
J
M
B
R
(
θ
;
o
m
,
s
m
)
=
∑
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
J_{MBR}(\theta;o^m,s^m)=\frac{\sum_s P(o^m|s)^kP(s) A(s, s^m)}{\sum_{s'} P(o^m|s')^k P(s')}
JMBR(θ;om,sm)=∑s′P(om∣s′)kP(s′)∑sP(om∣s)kP(s)A(s,sm)
=
∑
s
:
r
∈
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
+
∑
s
:
r
∉
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
:
r
∈
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
+
∑
s
′
:
r
∉
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
= \frac{\sum_{s:r\in s} P(o^m|s)^kP(s) A(s, s^m)+\sum_{s:r\notin s} P(o^m|s)^kP(s) A(s, s^m)}{\sum_{s':r\in s'} P(o^m|s')^k P(s')+\sum_{s':r\notin s'} P(o^m|s')^k P(s')}
=∑s′:r∈s′P(om∣s′)kP(s′)+∑s′:r∈/s′P(om∣s′)kP(s′)∑s:r∈sP(om∣s)kP(s)A(s,sm)+∑s:r∈/sP(om∣s)kP(s)A(s,sm)
- 如果满足
r
∈
s
r\in s
r∈s,那么导数满足以下关系:
∂ P ( o m ∣ s ) k ∂ l o g p ( o t m ∣ r ) = ∂ e k ∗ l o g P ( o m ∣ s ) ∂ l o g p ( o t m ∣ r ) = k ∗ P ( o m ∣ s ) k \frac{\partial P(o^m|s)^k}{\partial log p(o_t^m|r)}=\frac{\partial e^{k*logP(o^m|s)}}{\partial log p(o_t^m|r)}=k*P(o^m|s)^k ∂logp(otm∣r)∂P(om∣s)k=∂logp(otm∣r)∂ek∗logP(om∣s)=k∗P(om∣s)k - 如果不满足
r
∈
s
r\in s
r∈s,那么导数将为0:
∂ P ( o m ∣ s ) k ∂ l o g p ( o t m ∣ r ) = 0 \frac{\partial P(o^m|s)^k}{\partial log p(o_t^m|r)}=0 ∂logp(otm∣r)∂P(om∣s)k=0
不难推出:
∂
J
M
B
R
(
θ
;
o
m
,
s
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
\frac{\partial J_{MBR}(\theta;o^m,s^m)}{\partial logp(o_t^m|r)}
∂logp(otm∣r)∂JMBR(θ;om,sm)
=
k
∗
∑
s
:
r
∈
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
−
k
∗
∑
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
∗
∑
s
:
r
∈
s
P
(
o
m
∣
s
)
k
P
(
s
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
=k*\frac{\sum_{s:r\in s} P(o^m|s)^kP(s) A(s, s^m)}{\sum_{s'} P(o^m|s')^k P(s')}-k*\frac{\sum_s P(o^m|s)^kP(s) A(s, s^m)}{\sum_{s'} P(o^m|s')^k P(s')}*\frac{\sum_{s:r\in s} P(o^m|s)^kP(s)}{\sum_{s'} P(o^m|s')^k P(s')}
=k∗∑s′P(om∣s′)kP(s′)∑s:r∈sP(om∣s)kP(s)A(s,sm)−k∗∑s′P(om∣s′)kP(s′)∑sP(om∣s)kP(s)A(s,sm)∗∑s′P(om∣s′)kP(s′)∑s:r∈sP(om∣s)kP(s)
上面的等式可以简化为以下形式:
∂
J
M
B
R
(
θ
;
o
m
,
s
m
)
∂
l
o
g
p
(
o
t
m
∣
r
)
=
k
∗
γ
¨
m
t
D
E
N
(
r
)
(
A
ˉ
m
(
r
=
s
t
m
)
−
A
ˉ
m
)
\frac{\partial J_{MBR}(\theta;o^m,s^m)}{\partial logp(o_t^m|r)}=k*\ddot{\gamma}_{mt}^{DEN}(r)(\bar{A}^m(r=s_t^m)-\bar{A}^m)
∂logp(otm∣r)∂JMBR(θ;om,sm)=k∗γ¨mtDEN(r)(Aˉm(r=stm)−Aˉm)
各个部分的定义如下:
γ
¨
m
t
D
E
N
(
r
)
=
∑
s
:
r
∈
s
P
(
o
m
∣
s
)
k
P
(
s
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
\ddot{\gamma}_{mt}^{DEN}(r)=\frac{\sum_{s:r\in s} P(o^m|s)^kP(s)}{\sum_{s'} P(o^m|s')^k P(s')}
γ¨mtDEN(r)=∑s′P(om∣s′)kP(s′)∑s:r∈sP(om∣s)kP(s)
A
ˉ
m
=
∑
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
\bar{A}^m=\frac{\sum_s P(o^m|s)^kP(s) A(s, s^m)}{\sum_{s'} P(o^m|s')^k P(s')}
Aˉm=∑s′P(om∣s′)kP(s′)∑sP(om∣s)kP(s)A(s,sm)
A
ˉ
m
(
r
=
s
t
m
)
=
E
(
A
(
s
,
s
m
)
)
=
∑
s
:
r
∈
s
P
(
o
m
∣
s
)
k
P
(
s
)
A
(
s
,
s
m
)
∑
s
′
:
r
∈
s
′
P
(
o
m
∣
s
′
)
k
P
(
s
′
)
\bar{A}^m(r=s_t^m)=\mathbb E(A(s, s^m))=\frac{\sum_{s:r\in s} P(o^m|s)^kP(s)A(s,s^m)}{\sum_{s':r\in s'} P(o^m|s')^k P(s')}
Aˉm(r=stm)=E(A(s,sm))=∑s′:r∈s′P(om∣s′)kP(s′)∑s:r∈sP(om∣s)kP(s)A(s,sm)
第一项表示occupancy statistics
第二项表示lattice中所有路径的平均准确率
第三项表示lattice中所有经过r的路径的平均准确率,是
A
(
s
,
s
m
)
A(s, s^m)
A(s,sm)的均值,可以将三个三项合并起来进行还原就很容易里面均值的含义。
第二部分
第二部分和MMI的一致
tricks
lattice generation
区分性训练时生成高质量的lattice很重要,需要使用最好的模型来生成对应的lattice,并且作为seed model。
###lattice compensation
如果lattice产生的不合理的话,会导致计算出来的梯度异常,比如分子的标注路径没有在分母中的lattice出现,这种情况对于silience帧尤其常见,因为silience经常出现在分子的lattice,但是很容易被分母的lattice忽略。有一些方法可以解决这种问题:
- fame rejection,直接删除这些帧
- 根据reference hypothesis修正lattice,比如在lattice中人为地添加一下silience边
frame smoothing
SDT很容易出现overfitting,两方面原因
- sparse lattice
- sdt的squence相比于frame增加了建模的维度,导致训练集的后验概率分布更容易跟测试集出现差异
可以修改训练准则来减弱overfitting,通过结合squence criteria和frame criteria来实现:
J
F
S
−
S
E
Q
(
θ
;
S
)
=
(
1
−
H
)
J
C
E
(
θ
;
S
)
+
H
J
S
E
Q
(
θ
;
S
)
J_{FS-SEQ}(\theta;S)=(1-H)J_{CE}(\theta;S)+HJ_{SEQ}(\theta;S)
JFS−SEQ(θ;S)=(1−H)JCE(θ;S)+HJSEQ(θ;S)
H
H
H成为smoothing factor,经验值设为
4
/
5
4/5
4/5到
10
/
11
10/11
10/11
###learning rate
SDT的学习率相比于CE要下,因为
- SDT的起点一般基于CE训练出来的model
- SDT训练容易出现overfitting
criterion selection
sMBR效果相比其他会好一点,MMI比较容易理解和实现。
noise contrastIve estimation
NCE可以用于加速训练
参考
[1]《automatic speech recognition a deep learning approach》 chapter8
[2]Sequence-discriminative training of deep neural networks
[3]Boosted MMI for model and feature-space discriminative training
[4]discriminative training for large vocabulary speech recognition {daniel povey的博士论文chapter6}