CS224N notes_chapter2_word2vec

最新推荐文章于 2020-12-12 17:47:21 发布

lirt15

最新推荐文章于 2020-12-12 17:47:21 发布

阅读量194

点赞数 1

分类专栏： CS224N 文章标签： NLP cs224n

本文链接：https://blog.csdn.net/lirt15/article/details/94392313

版权

CS224N 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

第二讲 word2vec

1 Word meaning

the idea that is represented by a word, phrase, writing, art etc.
How do we have usable meaning in a computer?
Common answer: toxonomy(分类系统) like WordNet that has hypernyms relations(is-a) and synonym(同义词) sets.
Problems with toxonomy：

missing nuances(细微差别) 比如 proficient 就比 good 更适合形容专家, 但是在分类系统中它们就是同义词
missing new words
subjective
requires human labor to create and adapt
Hard to compute accurate word similarity

Problems with discrete representation: one-hot representation dimensions.
[0,0,0,...,1,...,0]
and one-hot doesn’t give the relation/similarity between words.
Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.
Next, we want to use vectors to represent words.
distributional: understand word meaning by context.
distributed:dense vectors to represent the meaning of the words.

2. Word2vec intro

Basic idea of learning Neural Network word embeddings
We def a model to predict the center word $w_t$ and context words in terms of word vectors.
$p(context|w_t)$
which has a loss function like
$J = 1 -p(w_{-t}|w_t)$
-t means neighbors of $w_t$ except $w_t$
Main idea of word2vec: Predict between every word and its context words.
Two algorithms.

Skip-grams(SG)
Predict context words given target(position independent)
… turning into banking crises as …
banking: center word
turning: $p(w_{t-2}|w_t)$
For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word
$J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta)$
hyperparameter: window size m
we use $p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)}$ ,
the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution.
Continuous Bag of Words(CBOW)
Predict target word from bag-of-words context.

3. Research highlight

omit

4. Word2vec objective function gradients

all parameters in model
$\theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right]$
We try to optimize these parameters by training the model. We use gradients descent.
$\begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned}$

5. Optimization refresher

We have the gradients at point x. Then we go along the negative gradients.
$\theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta)$
$\alpha$ : step size.
In matrix notation for parameters
$\theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta)$
Stochastic Gradient Descent: