CS224N notes_chapter8_RNN & LM

第八讲 RNN & LM

Language Model

A language model computes a probability for sequence of words: P ( w 1 , w 2 , . . . , w T ) P(w_1,w_2,...,w_T) P(w1,w2,...,wT)

  • Useful for machine translation
    • Word ordering: p(the cat is small)>p(small the is cat)
    • Word choice: p(walking home after school)>p(walking house after school)

Prob is usually conditioned on window of n previous words.
A incorrect but necessary Markov assumption!
P ( w 1 , w 2 , . . . , w T ) = ∏ i = 1 m P ( w i ∣ w 1 , w 2 , . . . , w i − 1 ) = ∏ i = 1 m P ( w i ∣ w i − n , w i − n + 1 , . . . , w i − 1 ) P(w_1,w_2,...,w_T)=\prod_{i=1}^m P(w_i|w_1,w_2,...,w_{i-1}) =\prod_{i=1}^m P(w_i|w_{i-n},w_{i-n+1},...,w_{i-1}) P(w1,w2,...,wT)=i=1mP(wiw1,w2,...,wi1)=i=1mP(wiwin,win+1,...,wi1)
To estimate prob, we could just count the words(n-grams)
p ( w 2 ∣ w 1 ) = c o u n t ( w 1 , w 2 ) c o u n t ( w 1 ) p(w_2|w_1)=\frac {count(w_1,w_2)}{count(w_1)} p(w2w1)=count(w1)count(w1,w2)

Recurrent Neural Networks

At a single time step:
h t = σ ( W h h h t − 1 + W h x x [ t ] ) y ^ t = s o f t m a x ( W S h t ) P ^ ( x t + 1 = v j ∣ x t , . . . , x 1 ) = y ^ t , j h_t = \sigma(W^{hh}h_{t-1} + W^{hx}x_[t]) \\ \hat y _t=softmax(W^{S}h_t) \\ \hat P(x_{t+1}=v_j|x_t,...,x_1)=\hat y_{t,j} ht=σ(Whhht1+Whxx[t])y^t=softmax(WSht)P^(xt+1=vjxt,...,x1)=y^t,j
loss func
J ( t ) ( θ ) = − ∑ j = 1 ∣ V ∣ y t , j l o g y ^ t , j J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j} J(t)(θ)=j=1Vyt,jlogy^t,j
RNN LM
J ( θ ) = − 1 T ∑ t = 1 T ∑ j = 1 ∣ V ∣ y t , j l o g y ^ t , j J(\theta)=-\frac 1 T\sum_{t=1}^{T}\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j} J(θ)=T1t=1Tj=1Vyt,jlogy^t,j
Perplexity: 2 J 2^J 2J

Vanishing gradient problem

∂ E t ∂ W = ∑ k = 1 t ∂ E t ∂ y t ∂ y t ∂ h t ∂ h t ∂ h k ∂ h k ∂ W ∂ h t ∂ h k = ∏ j = k + 1 t ∂ h j ∂ h j − 1 \begin{aligned} \frac{\partial E_t}{\partial W}=& \sum_{k=1}^t \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W} \\ \frac{\partial h_t}{\partial h_k}=& \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} \end{aligned} WEt=hkht=k=1tytEthtythkhtWhkj=k+1thj1hj
#当我们计算 ∂ h j ∂ h j − 1 \frac{\partial h_j}{\partial h_{j-1}} hj1hj导数时,注意到激活函数的值域是(0,1),因此这个链式法则中会出现多个激活函数连续相乘的情况,导致梯度消失
Trick for exploding gradient: clip
g ^ = t h r e h o l d ∣ ∣ g ∣ ∣ g ^ \hat g = \frac {threhold} {||g||} \hat g g^=gthreholdg^
Softmax too slow:
p ( w t ∣ h i s t o r y ) = p ( c t ∣ h i s t o r y ) p ( w t ∣ c t ) = p ( c t ∣ h t ) p ( w t ∣ c t ) \begin{aligned} &p(w_t|history) \\ =&p(c_t|history)p(w_t|c_t) \\ =&p(c_t|h_t)p(w_t|c_t) \\ \end{aligned} ==p(wthistory)p(cthistory)p(wtct)p(ctht)p(wtct)
more class-> better perplexity & bad speed.

Bidirectional RNNs

h 1 t = f ( W 1 x t + V 1 h t − 1 + b 1 ) h 2 t = f ( W 2 x t + V 2 h t − 1 + b 2 ) y t = g ( U [ h 1 t ; h 2 t ] + c ) \begin{aligned} h1_t=&f(W1x_t+V1h_{t-1}+b_1) \\ h2_t=&f(W2x_t+V2h_{t-1}+b_2) \\ y_t =& g(U[h1_t;h2_t]+c) \end{aligned} h1t=h2t=yt=f(W1xt+V1ht1+b1)f(W2xt+V2ht1+b2)g(U[h1t;h2t]+c)
BiRNN -> deep bidirectional RNNs

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值