CS224N notes_chapter2_word2vec

第二讲 word2vec

1 Word meaning

the idea that is represented by a word, phrase, writing, art etc.
How do we have usable meaning in a computer?
Common answer: toxonomy(分类系统) like WordNet that has hypernyms relations(is-a) and synonym(同义词) sets.
Problems with toxonomy:

  • missing nuances(细微差别) 比如 proficient 就比 good 更适合形容专家, 但是在分类系统中它们就是同义词
  • missing new words
  • subjective
  • requires human labor to create and adapt
  • Hard to compute accurate word similarity

Problems with discrete representation: one-hot representation dimensions.
[0,0,0,...,1,...,0]
and one-hot doesn’t give the relation/similarity between words.
Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.
Next, we want to use vectors to represent words.
distributional: understand word meaning by context.
distributed:dense vectors to represent the meaning of the words.

2. Word2vec intro

Basic idea of learning Neural Network word embeddings
We def a model to predict the center word w t w_t wt and context words in terms of word vectors.
p ( c o n t e x t ∣ w t ) p(context|w_t) p(contextwt)
which has a loss function like
J = 1 − p ( w − t ∣ w t ) J = 1 -p(w_{-t}|w_t) J=1p(wtwt)
-t means neighbors of w t w_t wt except w t w_t wt
Main idea of word2vec: Predict between every word and its context words.
Two algorithms.

  1. Skip-grams(SG)
    Predict context words given target(position independent)
    … turning into banking crises as …
    banking: center word
    turning: p ( w t − 2 ∣ w t ) p(w_{t-2}|w_t) p(wt2wt)
    For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word
    J ′ ( θ ) = ∏ t = 1 T ∏ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J ( θ ) = − 1 T ∑ t = 1 T ∑ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) J(θ)=t=1T0mjm,j̸=0P(wt+jwt;θ)J(θ)=T1t=1T0mjm,j̸=0P(wt+jwt;θ)
    hyperparameter: window size m
    we use p ( w t + j ∣ w t ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+jwt)=w=1Vexp(uwTvc)exp(uoTvc),
    the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution.

  2. Continuous Bag of Words(CBOW)
    Predict target word from bag-of-words context.

3. Research highlight

omit

4. Word2vec objective function gradients

all parameters in model
θ = [ v a ⋮ v z e b r a u a ⋮ u z e b r a ] \theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right] θ=vavzebrauauzebra
We try to optimize these parameters by training the model. We use gradients descent.
∂ ∂ v c log ⁡ e x p ( u o T v c ) − log ⁡ ∑ x = 1 V e x p ( u w T v c ) = u o − ∑ x = 1 V u x e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) = u 0 − ∑ x = 1 v p ( x ∣ c ) u x \begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned} ==vclogexp(uoTvc)logx=1Vexp(uwTvc)uow=1Vexp(uwTvc)x=1Vuxexp(uxTvc)u0x=1vp(xc)ux

5. Optimization refresher

We have the gradients at point x. Then we go along the negative gradients.
θ j n e w = θ j o l d − α ∂ ∂ θ j o l d J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta) θjnew=θjoldαθjoldJ(θ)
α \alpha α: step size.
In matrix notation for parameters
θ j n e w = θ j o l d − α ∇ θ J ( θ ) \theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta) θjnew=θjoldαθJ(θ)
Stochastic Gradient Descent:

  • global update -> much time
  • mini batch -> also good idea

6. Assignment 1 notes

7. Usefulness of Wordvec

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值