CS224N notes_chapter8_RNN & LM

lirt15

于 2019-07-05 17:39:08 发布

阅读量116

点赞数

分类专栏： CS224N 文章标签： NLP CS224N

本文链接：https://blog.csdn.net/lirt15/article/details/94744644

版权

CS224N 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

第八讲 RNN & LM

Language Model

A language model computes a probability for sequence of words: $P(w_1,w_2,...,w_T)$

Useful for machine translation
- Word ordering: p(the cat is small)>p(small the is cat)
- Word choice: p(walking home after school)>p(walking house after school)

Prob is usually conditioned on window of n previous words.
A incorrect but necessary Markov assumption!
$P(w_1,w_2,...,w_T)=\prod_{i=1}^m P(w_i|w_1,w_2,...,w_{i-1}) =\prod_{i=1}^m P(w_i|w_{i-n},w_{i-n+1},...,w_{i-1})$
To estimate prob, we could just count the words(n-grams)
$p(w_2|w_1)=\frac {count(w_1,w_2)}{count(w_1)}$

Recurrent Neural Networks

At a single time step:
$h_t = \sigma(W^{hh}h_{t-1} + W^{hx}x_[t]) \\ \hat y _t=softmax(W^{S}h_t) \\ \hat P(x_{t+1}=v_j|x_t,...,x_1)=\hat y_{t,j}$
loss func
$J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j}$
RNN LM
$J(\theta)=-\frac 1 T\sum_{t=1}^{T}\sum_{j=1}^{|V|}y_{t,j}log \hat{y}_{t,j}$
Perplexity: $2^J$

Vanishing gradient problem

$\begin{aligned} \frac{\partial E_t}{\partial W}=& \sum_{k=1}^t \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W} \\ \frac{\partial h_t}{\partial h_k}=& \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} \end{aligned}$
#当我们计算 $\frac{\partial h_j}{\partial h_{j-1}}$ 导数时，注意到激活函数的值域是(0,1)，因此这个链式法则中会出现多个激活函数连续相乘的情况，导致梯度消失
Trick for exploding gradient: clip
$\hat g = \frac {threhold} {||g||} \hat g$
Softmax too slow:
$\begin{aligned} &p(w_t|history) \\ =&p(c_t|history)p(w_t|c_t) \\ =&p(c_t|h_t)p(w_t|c_t) \\ \end{aligned}$
more class-> better perplexity & bad speed.

Bidirectional RNNs

$\begin{aligned} h1_t=&f(W1x_t+V1h_{t-1}+b_1) \\ h2_t=&f(W2x_t+V2h_{t-1}+b_2) \\ y_t =& g(U[h1_t;h2_t]+c) \end{aligned}$
BiRNN -> deep bidirectional RNNs