第二讲 word2vec
1 Word meaning
the idea that is represented by a word, phrase, writing, art etc.
How do we have usable meaning in a computer?
Common answer: toxonomy(分类系统) like WordNet that has hypernyms relations(is-a) and synonym(同义词) sets.
Problems with toxonomy:
- missing nuances(细微差别) 比如 proficient 就比 good 更适合形容专家, 但是在分类系统中它们就是同义词
- missing new words
- subjective
- requires human labor to create and adapt
- Hard to compute accurate word similarity
Problems with discrete representation: one-hot representation dimensions.
[0,0,0,...,1,...,0]
and one-hot doesn’t give the relation/similarity between words.
Distributional similarity: you can get a lot of value for representing a word by means of its neighbors.
Next, we want to use vectors to represent words.
distributional: understand word meaning by context.
distributed:dense vectors to represent the meaning of the words.
2. Word2vec intro
Basic idea of learning Neural Network word embeddings
We def a model to predict the center word
w
t
w_t
wt and context words in terms of word vectors.
p
(
c
o
n
t
e
x
t
∣
w
t
)
p(context|w_t)
p(context∣wt)
which has a loss function like
J
=
1
−
p
(
w
−
t
∣
w
t
)
J = 1 -p(w_{-t}|w_t)
J=1−p(w−t∣wt)
-t means neighbors of
w
t
w_t
wt except
w
t
w_t
wt
Main idea of word2vec: Predict between every word and its context words.
Two algorithms.
-
Skip-grams(SG)
Predict context words given target(position independent)
… turning into banking crises as …
banking: center word
turning: p ( w t − 2 ∣ w t ) p(w_{t-2}|w_t) p(wt−2∣wt)
For each word t=1,…T, we predict surrounding words in a window of “radius” m of every word
J ′ ( θ ) = ∏ t = 1 T ∏ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J ( θ ) = − 1 T ∑ t = 1 T ∑ 0 m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ; θ ) J'(\theta)=\prod_{t=1}^T \prod_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) \\ J(\theta)=-\frac 1 T \sum_{t=1}^T \sum_{0m\leq j \leq m, j\neq 0} P(w_{t+j}|w_t;\theta) J′(θ)=t=1∏T0m≤j≤m,j̸=0∏P(wt+j∣wt;θ)J(θ)=−T1t=1∑T0m≤j≤m,j̸=0∑P(wt+j∣wt;θ)
hyperparameter: window size m
we use p ( w t + j ∣ w t ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+j∣wt)=∑w=1Vexp(uwTvc)exp(uoTvc),
the dot product will be greater if two words are more similar. And softmax maps the values to probability distribution. -
Continuous Bag of Words(CBOW)
Predict target word from bag-of-words context.
3. Research highlight
omit
4. Word2vec objective function gradients
all parameters in model
θ
=
[
v
a
⋮
v
z
e
b
r
a
u
a
⋮
u
z
e
b
r
a
]
\theta=\left[\begin{aligned} v_a \\ \vdots \\ v_{zebra} \\ u_a \\ \vdots \\ u_{zebra} \end{aligned}\right]
θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡va⋮vzebraua⋮uzebra⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
We try to optimize these parameters by training the model. We use gradients descent.
∂
∂
v
c
log
e
x
p
(
u
o
T
v
c
)
−
log
∑
x
=
1
V
e
x
p
(
u
w
T
v
c
)
=
u
o
−
∑
x
=
1
V
u
x
e
x
p
(
u
x
T
v
c
)
∑
w
=
1
V
e
x
p
(
u
w
T
v
c
)
=
u
0
−
∑
x
=
1
v
p
(
x
∣
c
)
u
x
\begin{aligned} &\frac{\partial}{\partial v_c} \log{exp(u_o^Tv_c)}-\log{\sum_{x=1}^V}exp(u_w^Tv_c) \\ =& u_o - \frac{\sum_{x=1}^{V}u_x exp(u_x^Tv_c)}{\sum_{w=1}^Vexp(u_w^Tv_c)} \\ =&u_0 - \sum_{x=1}^{v}p(x|c)u_x \end{aligned}
==∂vc∂logexp(uoTvc)−logx=1∑Vexp(uwTvc)uo−∑w=1Vexp(uwTvc)∑x=1Vuxexp(uxTvc)u0−x=1∑vp(x∣c)ux
5. Optimization refresher
We have the gradients at point x. Then we go along the negative gradients.
θ
j
n
e
w
=
θ
j
o
l
d
−
α
∂
∂
θ
j
o
l
d
J
(
θ
)
\theta_j^{new}=\theta_j^{old} - \alpha\frac{\partial}{\partial \theta_j^{old}}J(\theta)
θjnew=θjold−α∂θjold∂J(θ)
α
\alpha
α: step size.
In matrix notation for parameters
θ
j
n
e
w
=
θ
j
o
l
d
−
α
∇
θ
J
(
θ
)
\theta_j^{new}=\theta_j^{old} - \alpha\nabla_\theta J(\theta)
θjnew=θjold−α∇θJ(θ)
Stochastic Gradient Descent:
- global update -> much time
- mini batch -> also good idea