第八讲 RNN & LM
Language Model
A language model computes a probability for sequence of words: P ( w 1 , w 2 , . . . , w T ) P(w_1,w_2,...,w_T) P(w1,w2,...,wT)
- Useful for machine translation
- Word ordering: p(the cat is small)>p(small the is cat)
- Word choice: p(walking home after school)>p(walking house after school)
Prob is usually conditioned on window of n previous words.
A incorrect but necessary Markov assumption!
P
(
w
1
,
w
2
,
.
.
.
,
w
T
)
=
∏
i
=
1
m
P
(
w
i
∣
w
1
,
w
2
,
.
.
.
,
w
i
−
1
)
=
∏
i
=
1
m
P
(
w
i
∣
w
i
−
n
,
w
i
−
n
+
1
,
.
.
.
,
w
i
−
1
)
P(w_1,w_2,...,w_T)=\prod_{i=1}^m P(w_i|w_1,w_2,...,w_{i-1}) =\prod_{i=1}^m P(w_i|w_{i-n},w_{i-n+1},...,w_{i-1})
P(w1,w2,...,wT)=i=1∏mP(wi∣w1,w2,...,wi−1)=i=1∏mP(wi∣wi−n,wi−n+1,...,wi−1)
To estimate prob, we could just count the words(n-grams)
p
(
w
2
∣
w
1
)
=
c
o
u
n
t
(
w
1
,
w
2
)
c
o
u
n
t
(
w
1
)
p(w_2|w_1)=\frac {count(w_1,w_2)}{count(w_1)}
p(w2∣w1)=count(w1)count(w1,w2)
Recurrent Neural Networks
At a single time step:
h
t
=
σ
(
W
h
h
h
t
−
1
+
W
h
x
x
[
t
]
)
y
^
t
=
s
o
f
t
m
a
x
(
W
S
h
t
)
P
^
(
x
t
+
1
=
v
j
∣
x
t
,
.
.
.
,
x
1
)
=
y
^
t
,
j
h_t = \sigma(W^{hh}h_{t-1} + W^{hx}x_[t]) \\ \hat y _t=softmax(W^{S}h_t) \\ \hat P(x_{t+1}=v_j|x_t,...,x_1)=\hat y_{t,j}
ht=σ(Whhht−1+Whxx[t])y^t=softmax(WSht)P^(xt+1=vj∣xt,...,x1)=y^t,j
loss func
J
(
t
)
(
θ
)
=
−
∑
j
=
1
∣
V
∣
y
t
,
j
l
o
g
y
^
t
,
j
J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j}
J(t)(θ)=−j=1∑∣V∣yt,jlogy^t,j
RNN LM
J
(
θ
)
=
−
1
T
∑
t
=
1
T
∑
j
=
1
∣
V
∣
y
t
,
j
l
o
g
y
^
t
,
j
J(\theta)=-\frac 1 T\sum_{t=1}^{T}\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j}
J(θ)=−T1t=1∑Tj=1∑∣V∣yt,jlogy^t,j
Perplexity:
2
J
2^J
2J
Vanishing gradient problem
∂
E
t
∂
W
=
∑
k
=
1
t
∂
E
t
∂
y
t
∂
y
t
∂
h
t
∂
h
t
∂
h
k
∂
h
k
∂
W
∂
h
t
∂
h
k
=
∏
j
=
k
+
1
t
∂
h
j
∂
h
j
−
1
\begin{aligned} \frac{\partial E_t}{\partial W}=& \sum_{k=1}^t \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W} \\ \frac{\partial h_t}{\partial h_k}=& \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} \end{aligned}
∂W∂Et=∂hk∂ht=k=1∑t∂yt∂Et∂ht∂yt∂hk∂ht∂W∂hkj=k+1∏t∂hj−1∂hj
#当我们计算
∂
h
j
∂
h
j
−
1
\frac{\partial h_j}{\partial h_{j-1}}
∂hj−1∂hj导数时,注意到激活函数的值域是(0,1),因此这个链式法则中会出现多个激活函数连续相乘的情况,导致梯度消失
Trick for exploding gradient: clip
g
^
=
t
h
r
e
h
o
l
d
∣
∣
g
∣
∣
g
^
\hat g = \frac {threhold} {||g||} \hat g
g^=∣∣g∣∣threholdg^
Softmax too slow:
p
(
w
t
∣
h
i
s
t
o
r
y
)
=
p
(
c
t
∣
h
i
s
t
o
r
y
)
p
(
w
t
∣
c
t
)
=
p
(
c
t
∣
h
t
)
p
(
w
t
∣
c
t
)
\begin{aligned} &p(w_t|history) \\ =&p(c_t|history)p(w_t|c_t) \\ =&p(c_t|h_t)p(w_t|c_t) \\ \end{aligned}
==p(wt∣history)p(ct∣history)p(wt∣ct)p(ct∣ht)p(wt∣ct)
more class-> better perplexity & bad speed.
Bidirectional RNNs
h
1
t
=
f
(
W
1
x
t
+
V
1
h
t
−
1
+
b
1
)
h
2
t
=
f
(
W
2
x
t
+
V
2
h
t
−
1
+
b
2
)
y
t
=
g
(
U
[
h
1
t
;
h
2
t
]
+
c
)
\begin{aligned} h1_t=&f(W1x_t+V1h_{t-1}+b_1) \\ h2_t=&f(W2x_t+V2h_{t-1}+b_2) \\ y_t =& g(U[h1_t;h2_t]+c) \end{aligned}
h1t=h2t=yt=f(W1xt+V1ht−1+b1)f(W2xt+V2ht−1+b2)g(U[h1t;h2t]+c)
BiRNN -> deep bidirectional RNNs