cs224n - 语言模型
Language Model
Language Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(t)}\), compute the probability distribution of the next word \(\boldsymbol{x}^{(t+1)}\) :
\[ P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right) \]
You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\), then the
probability of this text (according to the Language Model) is:
\[ \begin{aligned} P\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\right) &=P\left(\boldsymbol{x}^{(1)}\right) \times P\left(\boldsymbol{x}^{(2)} | \boldsymbol{x}^{(1)}\right) \times \cdots \times P\left(\boldsymbol{x}^{(T)} | \boldsymbol{x}^{(T-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \\ &=\prod_{t=1}^{T} P\left(\boldsymbol{x}^{(t)} | \boldsymbol{x}^{(t-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \end{aligned} \]
n-gram Language Models
Question: How to learn a Language Model?
Answer: learn a n-gram Language Model!
Definition: A n-gram is a chunk of n consecutive words.
Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.
First we make a simplifying assumption: \(\boldsymbol{x}^{(t+1)}\) depends only on the preceding n-1 words. The conditional prob is:
\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right)=P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)=\frac{{P\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}}{{P\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}} \]
Question: How do we get these n-gram and (n-1)-gram probabilities?
Answer: By counting them in some large corpus of text!
\[ \approx \frac{\operatorname{count}\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}{\operatorname{count}\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)} \]
Sparsity Problems with n-gram Language Models:
- What if “students opened their ” never occurred in data? Then has probability 0! Add small ? to the count for every. This is called smoothing
- What if “students opened their” never occurred in data? Then we can’t calculate probability for any w! Just condition on “opened their” instead. This is called backoff.
A fixed-window neural Language Mode
Advantages:
- No sparsity problem
- Don’t need to store all observed n-grams
Problems:
- Fixed window is too small ;
- Enlarging window enlarges W ;
- Window can never be large enough !
- \(x^{(1)}\) and \(x^{(2)}\)are multiplied by completely different weights in W. No symmetry in how the inputs are processed.
Note : We need a neural architecture that can process any length input
Recurrent Neural Networks (RNNs)
Core idea : Apply the same weights W repeatedly
Advantages:
- Can process any length input ;
- Computation for step t can (in theory) use information from many steps back ;
- Model size doesn’t increase for longer input ;
- Same weights applied on every timestep, so there is symmetry in how inputs are processed ;
Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
Training a RNN Language Model
- Get a big corpus of text which is a sequence of words
- Feed into RNN-LM; compute output distribution \(\hat{\boldsymbol{y}}^{(t)}\) for every step t;
- Loss function on step t is cross-entropy between predicted probability distribution \(\hat{y}^{(t)}\), and the true next word \({y}^{(t)}\)(one-hot for \(x^{(t+1)}\)):
\[ J^{(t)}(\theta)=C E\left(\boldsymbol{y}^{(t)}, \hat{\boldsymbol{y}}^{(t)}\right)=-\sum_{w \in V} \boldsymbol{y}_{w}^{(t)} \log \hat{\boldsymbol{y}}_{w}^{(t)}=-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]
Average this to get overall loss for entire training set:
\[ J(\theta)=\frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta)=\frac{1}{T} \sum_{t=1}^{T}-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]
Backpropagation for RNNs
Evaluating Language Models
Why should we care about Language Modeling?
- Language Modeling is a benchmark task that helps us measure our progress on understanding language
- Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text