CS224N Lecture Note(1)-CSDN博客

本文链接：https://blog.csdn.net/PoloWitty/article/details/115922014

Words Representation

One-hot vector

Denotational semantics: The concept of representing an idea as a symbol( a word or a one-hot vector). It is sparse and cannot capture similarity.

SVD Based Methods

Loop over a massive dataset and accumulate word co-occurrence counts in some form of a matrix X
Perform SVD on X to get a $USV^T$ decomposition
Use the row of $U$ as the word embeddings for all words in our dictionary

The following are a few choices of X

Word-Document Matrix

Distributional semantics: The concept of representing the meaning of a word based on the context in which it usually appears.

Basic conjecture:

words that are related will often appear in the same documents

Building manner:

Loop over billions of documents and for each time word i appears in document j, we add one to entry $X_{ij}$ .

shortage

the matrix is very large ( $\R^{|V|\times M }$ )
it scales with the number of documents ( $M$ )

Window based Co-occurrence Matrix

similar to Word-Document Matrix

Building manner

the matrix $X$ stores co-occurrences of words, so it’s an affinity matrix
count the number of times each word appears inside a window of a particular size around the word of interest
calculate this count for all the words in corpus

Applying PCA to the co-occurrence matrix

No more to say than PCA

but the problems have to be mentioned

Goodness

make an efficient use of the statistics

Shortage

The dimensions of the matrix change very often( new words are added very frequently)
The matrix is extremely sparse since most words do not co-occur
The matrix is very high dimensional in general( the size of the vocabulary is large )
The imbalance in word frequency is drastic

Some solutions

Ignore function words(停用词)
Apply a ramp window ( i.e. weight the co-occurrence count based on distance between the words in the document )
Use Pearson correlation and set negative counts to 0 instead of using just raw count

Iteration Based Methods - Word2vec

Basic idea

design a model whose parameters are the word vectors
train the model on a certain objective

other idea are just the same as backpropagating, in each iteration, forward propagating, evaluate the errors, update the parameters by penalizing the error parameters.

Iteration based methods capture co-occurrence of words one at a time instead of capturing all co-occurrence counts directly.

Basic conjecture

This model based on the hypothesis that the similar words have similar context.

What is Word2vec

Word2vec is actually a software package that includes:

2 algorithms:
- CBOW: predict a center word from the surrounding context in terms of word vectors
- skip-gram: predict the distribution ( probability ) of context words from a center word
2 training methods
- negative sampling: define an objective by sampling negative examples
- hierarchical softmax: define an objective using an efficient tree structure to compute probabilities for all the vocabulary

Language Models(Unigrams, Bigrams,etc.)

Unigram model

take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent
$P(w_1,w_2,\cdots,w_n)=\prod_{i=1}^nP(w_i)$

Shortage

We know that next word is highly contingent upon the previous sequence of words.

Bigram model

let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.
$P(w_1,w_2,\cdots,w_n)=\prod_{i=2}^nP(w_i|w_{i-1})$

CBOW ( Continuous Bag of Words Model )

For each word, we want to learn 2 vectors:

$v$ : (input vector) when the word is in the context
$u$ : (output vector) when the word is in the center

Parameters

Known parameters:
- $x^{(c)}$ : the input sentence represented by one-hot word vectors or context
- $y^{(c)}$ : the output ( since we only have one output, so we just call this $y$ which is the one-hot vector of the known center word )
Unknown parameters:
- $n$ : an arbitrary size which defines the size of our embedding space
- $\mathcal V$ : ( $\in \R^{n\times|V|}$ ) the input word matrix such that the $i$ -th column of $\mathcal V$ is the n-dimensional embedded vector for word $w_i$ when it is an input of the model. We denote this $n\times 1$ vector as $v_i$
- $\mathcal U$ : ( $\in\R^{|V|\times n}$ ) the output word matrix such that the $j$ -th row of $ \mathcal U$ is an n-dimensional embedded vector for word $w_j$ when it is an output of the model. We denote the row of $ \mathcal U$ as $u_j$ .

Steps

We generate our one hot word vectors for the input context of size m : $(x^{(c-m)},\cdots, x^{(c-1)}, x^{(c+1)},\cdots, x^{(c+m)})$ ( $ \text{each vector }x\in\R^{|V|}$) ( random initialization )
We get our embedded word vectors for the context $(v_{c-m}=\mathcal Vx^{(c-m)},v_{c-m+1}=\mathcal Vx^{(c-m+1)},\cdots,v_{c+m}=\mathcal Vx^{(c+m)})$ ( each vector $\in \R^n$ )
Average these vectors to get $\hat v = \frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m} \in \R^n$
Generate a score vector $z=\mathcal U\hat v \in \R^{|V|}$ . As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score
Turn the scores into probabilities $\hat y = softmax(z) \in \R^{|V|}$
We desire our probabilities generated, $\hat y$ , to match the true probabilities, y , which also happens to be the one hot vector of the actual word

Train Method

Loss function: cross entropy
optimize method : SGD

And since the activate function is $s o f t m a x$ , so the target is :
$\begin{aligned} \text{minimize }J & =-logP(w_c|w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m})\\ & = -logP(u_c|\hat v)\\ & = -log\frac{exp(u^T_c\hat v)}{\sum^{|V|}_{j=1}exp(u_j^T\hat v)}\\ & = -u^T_c\hat v + log\sum^{|V|}_{j=1}exp(u_j^T\hat v) \end{aligned}$

Skip-Gram Model

Parameters

input: one hot vector (center word), represented by $x$ (since there is only one)
output: $y^{(j)}$
$\mathcal V$ and $\mathcal U$ are the same as in CBOW

Steps

We generate our one hot input vector $x\in \R^{|V|}$ of the center word
We get our embedded word vector for the center word $v_c=\mathcal Vx\in \R^n$
Generate a score vector $z=\mathcal Uv_c$
Turn the score vector into probabilities, $\hat y = softmax(z)$ . Note that $\hat y_{c-m},\cdots,\hat y_{c-1},\hat y_{c+1},\cdots,\hat y_{c+m}$ are the probabilities of observing each context word
We desire our probability vector generated to match the true probabilities which is $y^{(c-m)},\cdots,y^{(c-1)},y^{(c+1)},\cdots,y^{(c+m)}$ , the one hot vectors of the actual output

Train Method

Optimize method: SGD
Loss function: cross entropy

We invoke a Naive Bayes assumption to break out the probabilities. So our objective function became:
$\begin{aligned} \text{minimize }J &= -log P(w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m}|w_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}P(w_{c-m+j}|w_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}P(y_{c-m+j}|v_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}\frac{exp(u^T_{c-m+j}v_c)}{\sum^{|V|}_{k=1}exp(u_k^tv_c)}\\ &= - \sum^{2m}_{j=0,j\neq m}u^T_{c-m+j}+2mlog\sum^{|V|}_{k=1}exp(u_k^Tv_c) \end{aligned}$

Negative Sampling

It’s costly to compute the loss for the CBOW and Skip-Gram, because of the vocabulary size ( $∣ V ∣$ ) is large.

For every training step, instead of looping over the entire vocabulary, we can just sample several negative examples. We sample from a noise distribution ( $P_n(w)$ ) whose probabilities match the ordering of the frequency of the vocabulary. To augment our formulation of the problem to incorporate Negative Sampling, all we need to do is to update the :

objective function
gradients
update rules

Denote

Consider a pair $(w, c)$ of word and context.

$P (D = 1 ∣ w, c)$ : the probability that $(w, c)$ came from the corpus data ( i.e., they are actually the right pair)
$P (D = 0 ∣ w, c)$ : the probability that $(w, c)$ did not come from the corpus data ( i.e., they are the wrong pair )

Train Method

the probability now become:
$P(D=1|w,c,\theta)= \sigma(v_c^Tv_w)=\frac1{1+e^{(-v_c^Tv_w)}}$
here we take $\theta$ to be the parameters of the model, and in our case it’s $\mathcal V$ and $\mathcal U$

So the objective function’s target became to maximize the probability of $TP\times TN$ ( the meaning from co-occurrence matrix )
$\begin{aligned} \theta &= \mathop{\arg\max}_\theta\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde D}P(D=0|w,c,\theta)\\ &= \mathop{\arg\max}_\theta\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde D}(1-P(D=1|w,c,\theta))\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}logP(D=1|w,c,\theta)+\sum_{(w,c)\in \tilde D}log(1-P(D=1|w,c,\theta))\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}log\frac1{1+exp(-u_w^tv_c)}+\sum_{(w,c)\in \tilde D}log(1-\frac1{1+exp(-u_w^Tv_c)})\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}log\frac1{1+exp(-u_w^Tv_c)}+\sum_{(w,c)\in \tilde D}log(\frac1{1+exp(u_w^Tv_c)})\\ &= \mathop{\arg\min}_\theta\sum_{(w,c)\in D}-log\frac1{1+exp(-u_w^Tv_c)}-\sum_{(w,c)\in \tilde D}log\frac1{1+exp(u_w^Tv_c)} \end{aligned}$
$\tilde D$ is a “false” or “negative” corpus, where we would have sentences like “stock boil fish is toy”. They are unnatural sentences that should get a low probability of ever occurring. We can generate $\tilde D$ on the fly by randomly sampling this negative from the word bank.

For CBOW

our new objective function for observing the center word $u_c$ given the context vector $\hat v=\frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m}$ would be :
$-\log\sigma(u_c^T\hat v)-\sum^K_{k=1}\log\sigma(-\tilde u_k^T\hat v)$

For skip-gram

our new objective function for observing the context word $c - m + j$ given the center word $c$ would be:
$-\log\sigma(u_{c-m+j}^Tv_c)-\sum^K_{k=1}\log\sigma(-\tilde u_k^Tv_c)$
$\{\tilde u_k|k=1\cdots K\}$ are sampled from $P_n(w)$ . ( Possion distribution )

Tiny trick

While there is much discussion of what makes the best approximation for the choice of $P_n(w)$ , what seems to work best is the Unigram Model raised to the power of $\frac34$ .

Hierarchical Softmax

In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling tends to be better for frequent words and lower dimensional vectors.

Core idea

Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf of the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for word. Instead, each node of the graph(except the root and the leaves) is associated to a vector that the model is going to learn.

In this model, the probability of a word $w$ given a vector $w_i$ , $P(w|w_i)$ , is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to $w$ .

Notation

$L (w)$ : the number of nodes in the path from the root to the leaf $w$ . ( the root included and the leaf $w$ excluded )
$n (w, i)$ : the $i$ -th node on this path with associated vector $v_n(w,i)$ (i start from 1)
$c h (n)$ : for each inner node $n$ , we arbitrarily choose one of its children and call it $c h (n)$ (e.g. always the left node)

Train method

$P(w|w_i)=\prod^{L(w)-1}_{j=1}\sigma([n(w,j+1)=ch(n(w,j))]\cdot v_{n(w,j)}^Tv_{w_i})\\ \text{where } [x]= \begin{cases} 1&\text{if } x \text{ is true}\\ -1&\text{otherwise} \end{cases}$

explanation

$\prod$ : compute the product of terms based on the shape of the path from the root $n (w, 1)$ to the leaf $w$ .
$[n (w, j + 1) = c h (n (w, j))]$ : If we assume $c h (n)$ is always the left node of $n$ , then this term will returns 1 when the path goes left, and -1 if right. Furthermore, this term provides normalization. At a node $n$ , if we sum the probabilities for going to the left and right node, we can check that for any value of $v_n^Tv_{w_i}$ , $\sigma(v_n^Tv_{w_i})+\sigma(-v_n^Tv_{w_i})=1$ ( recall the graph of the sigmoid function ). The normalization also ensures that $\sum^{|V|}_{w=1}P(w|w_i)=1$ , just as in the original softmax.
$v_{n(w,j)}^Tv_{w_i}$ : compare the similarity of our input vector $v_{w_i}$ to each inner node vector $v^T_{n(w,j)}$ using a dot product.

example

Taking $w_2$ for example, we must take two left edges and then a right edge to reach $w_2$ from the root, so
$\begin{aligned} P(w_2|w_i)&=p(n(w_2,1),left)\cdot p(n(w_2,2),left)\cdot p(n(w_2,3),right)\\ &= \sigma(v^T_{n(w_2,1)}v_{w_i})\cdot \sigma(v^T_{n(w_2,2)}v_{w_i})\cdot \sigma(-v^T_{n(w_2,3)}v_{w_i}) \end{aligned}$
More details

train objective:

our goal is still to minimize the negative log likelihood $log P(w|w_i)$ .But instead of updating output vectors per word. we update the vectors of the nodes in the binary tree that are in the path from root to leaf node.

speed of training:

determined by the way in which the binary tree is constructed and words are assigned to leaf nodes. Mikolov ( who invented the Hierarchical Softmax algorithm ) use a binary Huffman tree, which assigns frequent words shorter path in the tree.