Lecture1
本节主要内容
word representation
word2vec
How to represent word?
Problems with resources like WordNet
1.Great as a resource but missing nuance
2 Missing new meanings of words
- Impossible to keep up-to-date
3.Subjective
4.Requires human labor to create and adapt
5.Can’t compute accurate word similarity
Problem with words as discrete symbols(one-hot vector)
1.No natural notion of similarity for one-hot vectors
2.Vector dimension is too large
Word2vec
Word2Vec:objective function
求极大似然
For each position
t
=
1
,
…
,
T
t = 1, … , T
t=1,…,T, predict context words within a
window of fixed size m, given center word
w
j
w_j
wj.
极大似然公式:
L
i
k
e
l
i
h
o
o
d
=
L
(
θ
)
=
∏
t
=
1
T
∏
−
m
≤
j
≤
m
P
(
w
t
+
j
∣
w
t
;
θ
)
Likelihood = L(\theta)=\prod_{t=1}^T\prod_{-m\leq j\leq m}P(w_{t+j}|w_t;\theta)
Likelihood=L(θ)=t=1∏T−m≤j≤m∏P(wt+j∣wt;θ)
损失函数:
J
(
θ
)
=
−
1
T
l
o
g
L
(
θ
)
=
−
1
T
∑
t
=
1
T
∑
−
m
≤
j
≤
m
l
o
g
P
(
w
t
+
j
∣
w
t
;
θ
)
J(\theta)=-\frac1TlogL(\theta)=-\frac1T\sum_{t=1}^T\sum_{-m\leq j \leq m}logP(w_{t+j}|w_t;\theta)
J(θ)=−T1logL(θ)=−T1t=1∑T−m≤j≤m∑logP(wt+j∣wt;θ)
如何计算 P ( w t + j ∣ w t ; θ ) P(w_{t+j}|w_t;\theta) P(wt+j∣wt;θ)?
定义符号如下,
- v w    w h e n    w    i s    a    c e n t e r    w o r d v_w\; when\; w\; is\; a\; center\; word vwwhenwisacenterword
- u w    w h e n    w    i s    a    c o n t e x t    w o r d u_w\; when\; w\; is\; a\;context\; word uwwhenwisacontextword
则
P
(
o
∣
c
)
=
e
x
p
(
u
o
T
v
c
)
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
P(o|c) = \frac{exp(u_o^Tv_c)}{\sum\limits_{w\in V}exp(u_w^Tv_c)}
P(o∣c)=w∈V∑exp(uwTvc)exp(uoTvc)
令 f ( θ ) = l o g P ( o ∣ c ) f(\theta)=logP(o|c) f(θ)=logP(o∣c) ,求偏导
∂
f
(
θ
)
∂
v
c
=
∂
∂
v
c
l
o
g
e
x
p
(
u
o
T
v
c
)
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
=
∂
∂
v
c
l
o
g
e
x
p
(
u
o
T
v
c
)
−
∂
∂
v
c
l
o
g
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
\begin{aligned} \frac{\partial f(\theta)}{\partial v_c}&=\frac \partial{\partial v_c}log\frac{exp(u_o^Tv_c)}{\sum\limits_{w\in V}exp(u^T_wv_c)}\\&=\frac \partial{\partial v_c}logexp(u_o^Tv_c)-\frac\partial{\partial v_c}log\sum\limits_{w\in V}exp(u_w^Tv_c)\end{aligned}
∂vc∂f(θ)=∂vc∂logw∈V∑exp(uwTvc)exp(uoTvc)=∂vc∂logexp(uoTvc)−∂vc∂logw∈V∑exp(uwTvc)
前一部分
f
1
(
θ
)
=
∂
∂
v
c
l
o
g
e
x
p
(
u
o
T
v
c
)
=
u
o
f_1(\theta)=\frac \partial{\partial v_c}logexp(u_o^Tv_c)=u_o
f1(θ)=∂vc∂logexp(uoTvc)=uo
后一部分
f
2
(
θ
)
=
∂
∂
v
c
l
o
g
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
=
1
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
∑
x
∈
V
e
x
p
(
u
x
T
v
c
)
⋅
u
x
=
∑
x
∈
V
e
x
p
(
u
x
T
v
c
)
∑
w
∈
V
e
x
p
(
u
w
T
v
c
)
⋅
u
x
=
∑
x
∈
V
P
(
x
∣
c
)
⋅
u
x
\begin{aligned} f_2(\theta)&=\frac\partial{\partial v_c}log\sum\limits_{w\in V}exp(u_w^Tv_c)\\&=\frac1{\sum\limits_{w\in V}exp(u_w^Tv_c)}\sum\limits_{x\in V}exp(u_x^Tv_c)\cdot u_x\\&=\sum\limits_{x\in V}\frac{exp(u_x^Tv_c)}{\sum\limits_{w\in V}exp(u_w^Tv_c)}\cdot u_x \\&=\sum\limits_{x\in V}P(x|c)\cdot u_x \end{aligned}
f2(θ)=∂vc∂logw∈V∑exp(uwTvc)=w∈V∑exp(uwTvc)1x∈V∑exp(uxTvc)⋅ux=x∈V∑w∈V∑exp(uwTvc)exp(uxTvc)⋅ux=x∈V∑P(x∣c)⋅ux
f
(
θ
)
=
f
1
(
θ
)
−
f
2
(
θ
)
=
u
o
−
∑
x
∈
V
P
(
x
∣
c
)
⋅
u
x
f(\theta) = f_1(\theta)-f_2(\theta)=u_o-\sum\limits_{x\in V}P(x|c)\cdot u_x
f(θ)=f1(θ)−f2(θ)=uo−x∈V∑P(x∣c)⋅ux
根据概率论的知识,
∑
x
=
1
V
p
(
x
∣
c
)
u
x
⃗
\sum_{x=1}^{V}p(x|c)\vec{u_x}
∑x=1Vp(x∣c)ux正是
u
o
⃗
\vec{u_o}
uo对应的期望向量的方向,而
∂
f
∂
v
c
⃗
\frac{\partial f}{\partial{\vec{v_c}}}
∂vc∂f这个梯度则是把当前的
u
o
⃗
\vec{u_o}
uo向其期望靠拢的话,需要的一个向量的差值,这与
∂
f
∂
v
c
⃗
\frac{\partial f}{\partial{\vec{v_c}}}
∂vc∂f的定义刚好一致。