cs224n-语言模型

最新推荐文章于 2024-08-09 22:24:25 发布

diaoqian7700

最新推荐文章于 2024-08-09 22:24:25 发布

阅读量121

点赞数

文章标签：人工智能开发工具

原文链接：http://www.cnblogs.com/curtisxiao/p/10656859.html

版权

cs224n - 语言模型

Language Model

Language Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(t)}\), compute the probability distribution of the next word \(\boldsymbol{x}^{(t+1)}\) :
\[ P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right) \]

You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\), then the
probability of this text (according to the Language Model) is:
\[ \begin{aligned} P\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\right) &=P\left(\boldsymbol{x}^{(1)}\right) \times P\left(\boldsymbol{x}^{(2)} | \boldsymbol{x}^{(1)}\right) \times \cdots \times P\left(\boldsymbol{x}^{(T)} | \boldsymbol{x}^{(T-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \\ &=\prod_{t=1}^{T} P\left(\boldsymbol{x}^{(t)} | \boldsymbol{x}^{(t-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \end{aligned} \]

n-gram Language Models

Question: How to learn a Language Model?

Answer: learn a n-gram Language Model!

Definition: A n-gram is a chunk of n consecutive words.

Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.

First we make a simplifying assumption: \(\boldsymbol{x}^{(t+1)}\) depends only on the preceding n-1 words. The conditional prob is:
\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right)=P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)=\frac{{P\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}}{{P\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}} \]

Question: How do we get these n-gram and (n-1)-gram probabilities?

Answer: By counting them in some large corpus of text!
\[ \approx \frac{\operatorname{count}\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}{\operatorname{count}\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)} \]

Sparsity Problems with n-gram Language Models:

What if “students opened their ” never occurred in data? Then has probability 0! Add small ? to the count for every. This is called smoothing
What if “students opened their” never occurred in data? Then we can’t calculate probability for any w! Just condition on “opened their” instead. This is called backoff.

A fixed-window neural Language Mode

Advantages:

No sparsity problem
Don’t need to store all observed n-grams

Problems:

Fixed window is too small ;
Enlarging window enlarges W ;
Window can never be large enough !
\(x^{(1)}\) and \(x^{(2)}\)are multiplied by completely different weights in W. No symmetry in how the inputs are processed.

Note : We need a neural architecture that can process any length input

Recurrent Neural Networks (RNNs)

Core idea : Apply the same weights W repeatedly

Advantages:

Can process any length input ;
Computation for step t can (in theory) use information from many steps back ;
Model size doesn’t increase for longer input ;
Same weights applied on every timestep, so there is symmetry in how inputs are processed ;

Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back

Training a RNN Language Model

Get a big corpus of text which is a sequence of words
Feed into RNN-LM; compute output distribution \(\hat{\boldsymbol{y}}^{(t)}\) for every step t;
Loss function on step t is cross-entropy between predicted probability distribution \(\hat{y}^{(t)}\), and the true next word \({y}^{(t)}\)(one-hot for \(x^{(t+1)}\)):

\[ J^{(t)}(\theta)=C E\left(\boldsymbol{y}^{(t)}, \hat{\boldsymbol{y}}^{(t)}\right)=-\sum_{w \in V} \boldsymbol{y}_{w}^{(t)} \log \hat{\boldsymbol{y}}_{w}^{(t)}=-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Average this to get overall loss for entire training set:
\[ J(\theta)=\frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta)=\frac{1}{T} \sum_{t=1}^{T}-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Backpropagation for RNNs

Evaluating Language Models

Why should we care about Language Modeling?

Language Modeling is a benchmark task that helps us measure our progress on understanding language
Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text

转载于:https://www.cnblogs.com/curtisxiao/p/10656859.html

diaoqian7700

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
cs224n-语言模型

cs224n - 语言模型Language ModelLanguage Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \bold...
复制链接

扫一扫