CS244n NLP with Deep Learning | Winter 2019
Lecture 1
word2vec
每一个单词可以变成一个vector, 相似的word vector也相似
word = [ 0.23 0.52 − 0.41 − 0.31 0.27 0.48 ] \begin{bmatrix} 0.23 \\ 0.52 \\ -0.41 \\ -0.31 \\ 0.27 \\ 0.48 \end{bmatrix} ⎣⎢⎢⎢⎢⎢⎢⎡0.230.52−0.41−0.310.270.48⎦⎥⎥⎥⎥⎥⎥⎤
J ( θ ) = 1 T ∑ t = 1 T ∑ − m ≤ j ≤ m j ≠ 0 log p ( w t + j ∣ w t ; θ ) J(\theta) = \frac{1}{T}\sum^T_{t=1} \sum_{\begin{matrix}-m \le j \le m \\ j \neq 0 \end{matrix}} \log p(w_{t+j} | w_{t; \theta}) J(θ)=T1t=1∑T−m≤j≤mj=0∑logp(wt+j∣wt;θ)
这里m是单词j的sliding window的左右长度
θ = [ v a p p l e v b a n a n a v z e b r a . . . v a p p l e v b a n a n a v z e b r a ] = R 2 d v \theta = \begin{bmatrix} v_{apple} \\ v_{banana} \\ v_{zebra} \\ ... \\ v_{apple} \\ v_{banana} \\ v_{zebra} \end{bmatrix} = \mathbb{R}^{2dv} θ=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎡vapplevbananavzebra...vapplevbananavzebra⎦⎥⎥⎥⎥⎥⎥⎥⎥⎤=R2dv
重复因为每个词有center和context两种representation。
derivative: ∂ log ( p ) ( e x p ( u o T V ˙ c ) ∑ e x p ( u o T V ˙ c ) ) ∂ v c = u 0 − \frac{\partial \log(p)(\frac{exp(u_o^T \dot V_c) }{ \sum exp(u_o^T \dot V_c)})} {\partial v_c} = u_0 - ∂vc∂log(p)(∑exp(uoTV˙c)exp(uoTV˙c))=u0−
∂ ∂ v c l o g ∑ w = 1 V exp ( u w T v ˙ c ) = 1 ∑ w = 1 v exp ( u w T v c ) ∗ ∑ x = 1 V exp ( u x T v c ) ∂ ∂ v c u x T v c = u 0 − ∑ x = 1 V exp ( u x T v c ) u x ∑ w = 1 v exp ( u w T v c ) = u 0 − ∑ x = 1 V p ( x ∣ c ) u x \frac{\partial }{\partial v_c}log\sum_{w=1}^V \exp(u_w^T \dot v_c) = \frac{1} {\sum_{w=1}^v \exp(u^T_w v_c)} * \sum_{x=1}^V\exp(u_x^Tv_c) \frac{\partial}{\partial v_c}u_x^Tv_c = \\ u_0 - \frac{ \sum_{x=1}^V \exp(u_x^Tv_c) u_x} {\sum_{w=1}^v \exp(u^T_w v_c)} = u_0 - \sum_{x=1}^Vp(x|c) u_x ∂vc∂logw=1∑Vexp(uwTv˙c)=∑w=1vexp(uwTvc)1∗x=1∑Vexp(uxTvc)∂vc∂uxTvc=u0−∑w=1vexp(uwTvc)∑x=1Vexp(uxTvc)ux=u0−x=1∑Vp(x∣c)ux
https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
疑问:是不是WI指的是 u w u_w uw, WO指的是 v w v_w vw ?
Lecture 2
Gradient descent
θ n e w = θ o l d − α ∇ θ J ( θ ) \theta^{new} = \theta^{old}- \alpha \nabla_\theta J(\theta) θnew=θold−α∇θJ(θ), α \alpha α is the learnnig rate
Stochastic Gradient Descnet and negative sampling
P ( O ∣ C ) = e x p ( u o T V ˙ c ) ∑ w ∈ V e x p ( u w T V ˙ c ) P(O|C) =\frac{exp(u_o^T \dot V_c) }{ \sum_{w \in V} exp(u_w^T \dot V_c)} P(O∣C)=∑w∈Vexp(uwTV˙c)exp(uoTV˙c