Language Model
probability of a sequence of words
- P(w1, w2, …, wT)
Useful for machine learning:
word - ordering
- p(the cat is small) > p(small the is cat)
word - choice
- p(walking home after school) > p(walking house after school)
Traditional Language Model
条件概率,其中 window size = n
assumption
P(w1,w2,...,wT)=∏i=1mP(wi|w1,wi−1)≈∏i=1mP(wi|w1,wi−1)
n-gram
- unigram p(w2|w1)=count(w1,w2)count(w1)
- bigram
p(w3|w1,w2)=count(w1,w2,w3)count(w1,w2)
n-gram 耗费大量内存
RNN
- 每步权重互联
- 条件依赖于之前所有单词
- RAM 耗费只同单词量相关
ht=σ(Whhht−1+Whxxt)
y^t=softmax(Wsht)
训练 RNN is hard
vanishing / exploding gradient problem
total error
∂E∂W=∑t=1T∂Et∂W
∂Et∂W=∑k=1T∂Et∂yt⋅∂yt∂ht⋅∂ht∂hk⋅∂hk∂W
其中
∂ht∂hk=∏j=k+1t∂hj∂hj−1
故
由于取
则
∂ht∂hk=∏j=k+1t∂hj∂hj−1=∏j=k+1tWTdiag(f′(hj−1))
||∂hj∂hj−1||<=||WT||⋅||diag(f′(hj−1)||<=βWβh
||∂ht∂hk||=||∏j=k+1t∂hj∂hj−1||<=(βWβh)t−k
可能非常快的就变得很大或者很小。
vanishing gradient problem 使得许多步之前的对当前训练的影响微乎其微
exploding gradient clip gradient
vanishing gradient -> Initialization + ReLus
softmax is huge and slow
- class based trick
双向 RNN
- 之前和之后的训练词对当前训练都有影响
深度双向 RNN
F1 度量
precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2(precision recall)/(precsion + recall)