CS224d-Lecture8

最新推荐文章于 2024-08-01 16:20:47 发布

地球原住民

最新推荐文章于 2024-08-01 16:20:47 发布

阅读量575

点赞数

分类专栏：机器学习文章标签：机器学习 nlp

本文链接：https://blog.csdn.net/javafreely/article/details/71057328

版权

机器学习专栏收录该内容

6 篇文章

订阅专栏

本文探讨了语言模型在机器学习中的应用，特别是通过条件概率来评估词序与词汇选择的合理性。介绍了传统语言模型和n-gram模型，并讨论了递归神经网络（RNN）在处理序列数据时的优势与挑战，包括梯度消失与爆炸问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Language Model

probability of a sequence of words

P(w1, w2, …, wT)

Useful for machine learning:

word - ordering

p(the cat is small) > p(small the is cat)

word - choice

p(walking home after school) > p(walking house after school)

Traditional Language Model

条件概率，其中 window size = n

assumption

P (w 1, w 2, . . ., w T) = \prod i = 1 m P (w i | w 1, w i - 1) \approx \prod i = 1 m P (w i | w 1, w i - 1)

$\begin{align*} P(w_1, w_2, ..., w_T) &= \prod_{i=1}^m P(w_i|w_1,w_{i-1}) \\ &\approx \prod_{i=1}^m P(w_i|w_1,w_{i-1}) \end{align*}$

n-gram

unigram $p(w_2|w_1) = \frac{count(w_1,w_2)}{count(w_1)}$
bigram $p(w_3|w_1, w_2) = \frac{count(w_1,w_2,w_3)}{count(w_1, w_2)}$
n-gram 耗费大量内存

RNN

每步权重互联
条件依赖于之前所有单词
RAM 耗费只同单词量相关

$h_t = \sigma(W^{hh} h_{t-1} + W^{hx}x_t)$
$\hat y_t = softmax(W^{s} h_t)$

训练 RNN is hard

vanishing / exploding gradient problem

total error

\partial E \partial W = \sum t = 1 T \partial E t \partial W

$\tfrac{\partial E}{\partial W} = \sum_{t=1}^T\tfrac{\partial E_t}{\partial W}$

\partial E t \partial W = \sum k = 1 T \partial E t \partial y t \cdot \partial y t \partial h t \cdot \partial h t \partial h k \cdot \partial h k \partial W

$\tfrac{\partial E_t}{\partial W} = \sum_{k=1}^T\tfrac{\partial E_t}{\partial y_t} \cdot \tfrac{\partial y_t}{\partial h_t} \cdot \tfrac{\partial h_t}{\partial h_k} \cdot \tfrac{\partial h_k}{\partial W}$
其中

\partial h t \partial h k = \prod j = k + 1 t \partial h j \partial h j - 1

$\tfrac{\partial h_t}{\partial h_k} = \prod_{j=k+1}^t \tfrac{\partial h_j}{\partial h_{j-1}}$
故

由于取

h t = W f (h t - 1) + W (h x) x [t]

$h_t = Wf(h_{t-1}) + W^{(hx)}x_{[t]}$
则

\partial h t \partial h k = \prod j = k + 1 t \partial h j \partial h j - 1 = \prod j = k + 1 t W T d i a g (f' (h j - 1))

$\tfrac{\partial h_t}{\partial h_k} = \prod_{j=k+1}^t \tfrac{\partial h_j}{\partial h_{j-1}} = \prod_{j=k+1}^t W^Tdiag(f'(h_{j-1}))$

| | \partial h j \partial h j - 1 | | < = | | W T | | \cdot | | d i a g (f' (h j - 1) | | < = β W β h

$||\tfrac{\partial h_j}{\partial h_{j-1}}|| <= ||W^T|| \cdot ||diag(f'(h_{j-1})|| <= \beta_W \beta_h$

| | \partial h t \partial h k | | = | | \prod j = k + 1 t \partial h j \partial h j - 1 | | < = (β W β h) t - k

$||\tfrac{\partial h_t}{\partial h_k}|| = || \prod_{j=k+1}^t\tfrac{\partial h_j}{\partial h_{j-1}}|| <= (\beta_W\beta_h)^{t-k}$
可能非常快的就变得很大或者很小。

vanishing gradient problem 使得许多步之前的对当前训练的影响微乎其微

exploding gradient clip gradient

vanishing gradient -> Initialization + ReLus

softmax is huge and slow

class based trick

双向 RNN

之前和之后的训练词对当前训练都有影响

深度双向 RNN

F1 度量

precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2(precision recall)/(precsion + recall)