CS224N Lecture Note(1)

Words Representation

One-hot vector

Denotational semantics: The concept of representing an idea as a symbol( a word or a one-hot vector). It is sparse and cannot capture similarity.

SVD Based Methods

  1. Loop over a massive dataset and accumulate word co-occurrence counts in some form of a matrix X
  2. Perform SVD on X to get a U S V T USV^T USVT decomposition
  3. Use the row of U U U as the word embeddings for all words in our dictionary

The following are a few choices of X

Word-Document Matrix

Distributional semantics: The concept of representing the meaning of a word based on the context in which it usually appears.

Basic conjecture:

words that are related will often appear in the same documents

Building manner:
  • Loop over billions of documents and for each time word i appears in document j, we add one to entry X i j X_{ij} Xij.
shortage
  • the matrix is very large ( R ∣ V ∣ × M \R^{|V|\times M } RV×M)
  • it scales with the number of documents ( M M M )

Window based Co-occurrence Matrix

similar to Word-Document Matrix

Building manner
  • the matrix X X X stores co-occurrences of words, so it’s an affinity matrix
  • count the number of times each word appears inside a window of a particular size around the word of interest
  • calculate this count for all the words in corpus

Applying PCA to the co-occurrence matrix

No more to say than PCA

but the problems have to be mentioned

Goodness

make an efficient use of the statistics

Shortage
  • The dimensions of the matrix change very often( new words are added very frequently)
  • The matrix is extremely sparse since most words do not co-occur
  • The matrix is very high dimensional in general( the size of the vocabulary is large )
  • The imbalance in word frequency is drastic
Some solutions
  • Ignore function words(停用词)
  • Apply a ramp window ( i.e. weight the co-occurrence count based on distance between the words in the document )
  • Use Pearson correlation and set negative counts to 0 instead of using just raw count

Iteration Based Methods - Word2vec

Basic idea
  1. design a model whose parameters are the word vectors
  2. train the model on a certain objective

other idea are just the same as backpropagating, in each iteration, forward propagating, evaluate the errors, update the parameters by penalizing the error parameters.

Iteration based methods capture co-occurrence of words one at a time instead of capturing all co-occurrence counts directly.

Basic conjecture

This model based on the hypothesis that the similar words have similar context.

What is Word2vec

Word2vec is actually a software package that includes:

  • 2 algorithms:
    • CBOW: predict a center word from the surrounding context in terms of word vectors
    • skip-gram: predict the distribution ( probability ) of context words from a center word
  • 2 training methods
    • negative sampling: define an objective by sampling negative examples
    • hierarchical softmax: define an objective using an efficient tree structure to compute probabilities for all the vocabulary

Language Models(Unigrams, Bigrams,etc.)

Unigram model

take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent
P ( w 1 , w 2 , ⋯   , w n ) = ∏ i = 1 n P ( w i ) P(w_1,w_2,\cdots,w_n)=\prod_{i=1}^nP(w_i) P(w1,w2,,wn)=i=1nP(wi)

Shortage

We know that next word is highly contingent upon the previous sequence of words.

Bigram model

let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.
P ( w 1 , w 2 , ⋯   , w n ) = ∏ i = 2 n P ( w i ∣ w i − 1 ) P(w_1,w_2,\cdots,w_n)=\prod_{i=2}^nP(w_i|w_{i-1}) P(w1,w2,,wn)=i=2nP(wiwi1)

CBOW ( Continuous Bag of Words Model )

For each word, we want to learn 2 vectors:

  • v v v : (input vector) when the word is in the context
  • u u u : (output vector) when the word is in the center
Parameters
  • Known parameters:

    • x ( c ) x^{(c)} x(c) : the input sentence represented by one-hot word vectors or context
    • y ( c ) y^{(c)} y(c) : the output ( since we only have one output, so we just call this y y y which is the one-hot vector of the known center word )
  • Unknown parameters:

    • n n n : an arbitrary size which defines the size of our embedding space
    • V \mathcal V V : ( ∈ R n × ∣ V ∣ \in \R^{n\times|V|} Rn×V) the input word matrix such that the i i i-th column of V \mathcal V V is the n-dimensional embedded vector for word w i w_i wi when it is an input of the model. We denote this n × 1 n\times 1 n×1 vector as v i v_i vi
    • U \mathcal U U : ( ∈ R ∣ V ∣ × n \in\R^{|V|\times n} RV×n) the output word matrix such that the j j j-th row of $ \mathcal U$ is an n-dimensional embedded vector for word w j w_j wj when it is an output of the model. We denote the row of $ \mathcal U$ as u j u_j uj.
Steps
  1. We generate our one hot word vectors for the input context of size m : ( x ( c − m ) , ⋯   , x ( c − 1 ) , x ( c + 1 ) , ⋯   , x ( c + m ) ) (x^{(c-m)},\cdots, x^{(c-1)}, x^{(c+1)},\cdots, x^{(c+m)}) (x(cm),,x(c1),x(c+1),,x(c+m)) ( $ \text{each vector }x\in\R^{|V|}$) ( random initialization )
  2. We get our embedded word vectors for the context ( v c − m = V x ( c − m ) , v c − m + 1 = V x ( c − m + 1 ) , ⋯   , v c + m = V x ( c + m ) ) (v_{c-m}=\mathcal Vx^{(c-m)},v_{c-m+1}=\mathcal Vx^{(c-m+1)},\cdots,v_{c+m}=\mathcal Vx^{(c+m)}) (vcm=Vx(cm),vcm+1=Vx(cm+1),,vc+m=Vx(c+m)) ( each vector v ∈ R n v \in \R^n vRn )
  3. Average these vectors to get v ^ = v c − m + v c − m + 1 + ⋯ + v c + m 2 m ∈ R n \hat v = \frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m} \in \R^n v^=2mvcm+vcm+1++vc+mRn
  4. Generate a score vector z = U v ^ ∈ R ∣ V ∣ z=\mathcal U\hat v \in \R^{|V|} z=Uv^RV. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score
  5. Turn the scores into probabilities y ^ = s o f t m a x ( z ) ∈ R ∣ V ∣ \hat y = softmax(z) \in \R^{|V|} y^=softmax(z)RV
  6. We desire our probabilities generated, y ^ \hat y y^ , to match the true probabilities, y , which also happens to be the one hot vector of the actual word
Train Method
  • Loss function: cross entropy
  • optimize method : SGD

And since the activate function is s o f t m a x softmax softmax , so the target is :
minimize  J = − l o g P ( w c ∣ w c − m , ⋯   , w c − 1 , w c + 1 , ⋯   , w c + m ) = − l o g P ( u c ∣ v ^ ) = − l o g e x p ( u c T v ^ ) ∑ j = 1 ∣ V ∣ e x p ( u j T v ^ ) = − u c T v ^ + l o g ∑ j = 1 ∣ V ∣ e x p ( u j T v ^ ) \begin{aligned} \text{minimize }J & =-logP(w_c|w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m})\\ & = -logP(u_c|\hat v)\\ & = -log\frac{exp(u^T_c\hat v)}{\sum^{|V|}_{j=1}exp(u_j^T\hat v)}\\ & = -u^T_c\hat v + log\sum^{|V|}_{j=1}exp(u_j^T\hat v) \end{aligned} minimize J=logP(wcwcm,,wc1,wc+1,,wc+m)=logP(ucv^)=logj=1Vexp(ujTv^)exp(ucTv^)=ucTv^+logj=1Vexp(ujTv^)

Skip-Gram Model

Parameters
  • input: one hot vector (center word), represented by x x x (since there is only one)
  • output: y ( j ) y^{(j)} y(j)
  • V \mathcal V V and U \mathcal U U are the same as in CBOW
Steps
  1. We generate our one hot input vector x ∈ R ∣ V ∣ x\in \R^{|V|} xRV of the center word
  2. We get our embedded word vector for the center word v c = V x ∈ R n v_c=\mathcal Vx\in \R^n vc=VxRn
  3. Generate a score vector z = U v c z=\mathcal Uv_c z=Uvc
  4. Turn the score vector into probabilities, y ^ = s o f t m a x ( z ) \hat y = softmax(z) y^=softmax(z). Note that y ^ c − m , ⋯   , y ^ c − 1 , y ^ c + 1 , ⋯   , y ^ c + m \hat y_{c-m},\cdots,\hat y_{c-1},\hat y_{c+1},\cdots,\hat y_{c+m} y^cm,,y^c1,y^c+1,,y^c+m are the probabilities of observing each context word
  5. We desire our probability vector generated to match the true probabilities which is y ( c − m ) , ⋯   , y ( c − 1 ) , y ( c + 1 ) , ⋯   , y ( c + m ) y^{(c-m)},\cdots,y^{(c-1)},y^{(c+1)},\cdots,y^{(c+m)} y(cm),,y(c1),y(c+1),,y(c+m), the one hot vectors of the actual output
Train Method
  • Optimize method: SGD
  • Loss function: cross entropy

We invoke a Naive Bayes assumption to break out the probabilities. So our objective function became:
minimize  J = − l o g P ( w c − m , ⋯   , w c − 1 , w c + 1 , ⋯   , w c + m ∣ w c ) = − l o g ∏ j = 0 , j ≠ m 2 m P ( w c − m + j ∣ w c ) = − l o g ∏ j = 0 , j ≠ m 2 m P ( y c − m + j ∣ v c ) = − l o g ∏ j = 0 , j ≠ m 2 m e x p ( u c − m + j T v c ) ∑ k = 1 ∣ V ∣ e x p ( u k t v c ) = − ∑ j = 0 , j ≠ m 2 m u c − m + j T + 2 m l o g ∑ k = 1 ∣ V ∣ e x p ( u k T v c ) \begin{aligned} \text{minimize }J &= -log P(w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m}|w_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}P(w_{c-m+j}|w_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}P(y_{c-m+j}|v_c)\\ &= -log \prod^{2m}_{j=0,j\neq m}\frac{exp(u^T_{c-m+j}v_c)}{\sum^{|V|}_{k=1}exp(u_k^tv_c)}\\ &= - \sum^{2m}_{j=0,j\neq m}u^T_{c-m+j}+2mlog\sum^{|V|}_{k=1}exp(u_k^Tv_c) \end{aligned} minimize J=logP(wcm,,wc1,wc+1,,wc+mwc)=logj=0,j=m2mP(wcm+jwc)=logj=0,j=m2mP(ycm+jvc)=logj=0,j=m2mk=1Vexp(uktvc)exp(ucm+jTvc)=j=0,j=m2mucm+jT+2mlogk=1Vexp(ukTvc)

Negative Sampling

It’s costly to compute the loss for the CBOW and Skip-Gram, because of the vocabulary size ( ∣ V ∣ |V| V ) is large.

For every training step, instead of looping over the entire vocabulary, we can just sample several negative examples. We sample from a noise distribution ( P n ( w ) P_n(w) Pn(w) ) whose probabilities match the ordering of the frequency of the vocabulary. To augment our formulation of the problem to incorporate Negative Sampling, all we need to do is to update the :

  • objective function
  • gradients
  • update rules
Denote

Consider a pair ( w , c ) (w,c) (w,c) of word and context.

  • P ( D = 1 ∣ w , c ) P(D=1|w,c) P(D=1w,c) : the probability that ( w , c ) (w,c) (w,c) came from the corpus data ( i.e., they are actually the right pair)
  • P ( D = 0 ∣ w , c ) P(D=0|w,c) P(D=0w,c) : the probability that ( w , c ) (w,c) (w,c) did not come from the corpus data ( i.e., they are the wrong pair )
Train Method

the probability now become:
P ( D = 1 ∣ w , c , θ ) = σ ( v c T v w ) = 1 1 + e ( − v c T v w ) P(D=1|w,c,\theta)= \sigma(v_c^Tv_w)=\frac1{1+e^{(-v_c^Tv_w)}} P(D=1w,c,θ)=σ(vcTvw)=1+e(vcTvw)1
here we take θ \theta θ to be the parameters of the model, and in our case it’s V \mathcal V V and U \mathcal U U

So the objective function’s target became to maximize the probability of T P × T N TP\times TN TP×TN ( the meaning from co-occurrence matrix )
θ = arg ⁡ max ⁡ θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ P ( D = 0 ∣ w , c , θ ) = arg ⁡ max ⁡ θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ ( 1 − P ( D = 1 ∣ w , c , θ ) ) = arg ⁡ max ⁡ θ ∑ ( w , c ) ∈ D l o g P ( D = 1 ∣ w , c , θ ) + ∑ ( w , c ) ∈ D ~ l o g ( 1 − P ( D = 1 ∣ w , c , θ ) ) = arg ⁡ max ⁡ θ ∑ ( w , c ) ∈ D l o g 1 1 + e x p ( − u w t v c ) + ∑ ( w , c ) ∈ D ~ l o g ( 1 − 1 1 + e x p ( − u w T v c ) ) = arg ⁡ max ⁡ θ ∑ ( w , c ) ∈ D l o g 1 1 + e x p ( − u w T v c ) + ∑ ( w , c ) ∈ D ~ l o g ( 1 1 + e x p ( u w T v c ) ) = arg ⁡ min ⁡ θ ∑ ( w , c ) ∈ D − l o g 1 1 + e x p ( − u w T v c ) − ∑ ( w , c ) ∈ D ~ l o g 1 1 + e x p ( u w T v c ) \begin{aligned} \theta &= \mathop{\arg\max}_\theta\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde D}P(D=0|w,c,\theta)\\ &= \mathop{\arg\max}_\theta\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde D}(1-P(D=1|w,c,\theta))\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}logP(D=1|w,c,\theta)+\sum_{(w,c)\in \tilde D}log(1-P(D=1|w,c,\theta))\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}log\frac1{1+exp(-u_w^tv_c)}+\sum_{(w,c)\in \tilde D}log(1-\frac1{1+exp(-u_w^Tv_c)})\\ &= \mathop{\arg\max}_\theta\sum_{(w,c)\in D}log\frac1{1+exp(-u_w^Tv_c)}+\sum_{(w,c)\in \tilde D}log(\frac1{1+exp(u_w^Tv_c)})\\ &= \mathop{\arg\min}_\theta\sum_{(w,c)\in D}-log\frac1{1+exp(-u_w^Tv_c)}-\sum_{(w,c)\in \tilde D}log\frac1{1+exp(u_w^Tv_c)} \end{aligned} θ=argmaxθ(w,c)DP(D=1w,c,θ)(w,c)D~P(D=0w,c,θ)=argmaxθ(w,c)DP(D=1w,c,θ)(w,c)D~(1P(D=1w,c,θ))=argmaxθ(w,c)DlogP(D=1w,c,θ)+(w,c)D~log(1P(D=1w,c,θ))=argmaxθ(w,c)Dlog1+exp(uwtvc)1+(w,c)D~log(11+exp(uwTvc)1)=argmaxθ(w,c)Dlog1+exp(uwTvc)1+(w,c)D~log(1+exp(uwTvc)1)=argminθ(w,c)Dlog1+exp(uwTvc)1(w,c)D~log1+exp(uwTvc)1
D ~ \tilde D D~ is a “false” or “negative” corpus, where we would have sentences like “stock boil fish is toy”. They are unnatural sentences that should get a low probability of ever occurring. We can generate D ~ \tilde D D~ on the fly by randomly sampling this negative from the word bank.

For CBOW

our new objective function for observing the center word u c u_c uc given the context vector v ^ = v c − m + v c − m + 1 + ⋯ + v c + m 2 m \hat v=\frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m} v^=2mvcm+vcm+1++vc+m would be :
− log ⁡ σ ( u c T v ^ ) − ∑ k = 1 K log ⁡ σ ( − u ~ k T v ^ ) -\log\sigma(u_c^T\hat v)-\sum^K_{k=1}\log\sigma(-\tilde u_k^T\hat v) logσ(ucTv^)k=1Klogσ(u~kTv^)

For skip-gram

our new objective function for observing the context word c − m + j c-m+j cm+j given the center word c c c would be:
− log ⁡ σ ( u c − m + j T v c ) − ∑ k = 1 K log ⁡ σ ( − u ~ k T v c ) -\log\sigma(u_{c-m+j}^Tv_c)-\sum^K_{k=1}\log\sigma(-\tilde u_k^Tv_c) logσ(ucm+jTvc)k=1Klogσ(u~kTvc)
{ u ~ k ∣ k = 1 ⋯ K } \{\tilde u_k|k=1\cdots K\} {u~kk=1K} are sampled from P n ( w ) P_n(w) Pn(w). ( Possion distribution )

Tiny trick

While there is much discussion of what makes the best approximation for the choice of P n ( w ) P_n(w) Pn(w) , what seems to work best is the Unigram Model raised to the power of 3 4 \frac34 43 .

Hierarchical Softmax

In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling tends to be better for frequent words and lower dimensional vectors.

Core idea

Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf of the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for word. Instead, each node of the graph(except the root and the leaves) is associated to a vector that the model is going to learn.

In this model, the probability of a word w w w given a vector w i w_i wi, P ( w ∣ w i ) P(w|w_i) P(wwi), is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to w w w.

Notation
  • L ( w ) L(w) L(w) : the number of nodes in the path from the root to the leaf w w w. ( the root included and the leaf w w w excluded )
  • n ( w , i ) n(w,i) n(w,i) : the i i i-th node on this path with associated vector v n ( w , i ) v_n(w,i) vn(w,i) (i start from 1)
  • c h ( n ) ch(n) ch(n) : for each inner node n n n, we arbitrarily choose one of its children and call it c h ( n ) ch(n) ch(n) (e.g. always the left node)
Train method

P ( w ∣ w i ) = ∏ j = 1 L ( w ) − 1 σ ( [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] ⋅ v n ( w , j ) T v w i ) where  [ x ] = { 1 if  x  is true − 1 otherwise P(w|w_i)=\prod^{L(w)-1}_{j=1}\sigma([n(w,j+1)=ch(n(w,j))]\cdot v_{n(w,j)}^Tv_{w_i})\\ \text{where } [x]= \begin{cases} 1&\text{if } x \text{ is true}\\ -1&\text{otherwise} \end{cases} P(wwi)=j=1L(w)1σ([n(w,j+1)=ch(n(w,j))]vn(w,j)Tvwi)where [x]={11if x is trueotherwise

explanation

  • ∏ \prod : compute the product of terms based on the shape of the path from the root n ( w , 1 ) n(w,1) n(w,1) to the leaf w w w.

  • [ n ( w , j + 1 ) = c h ( n ( w , j ) ) ] [n(w,j+1)=ch(n(w,j))] [n(w,j+1)=ch(n(w,j))] : If we assume c h ( n ) ch(n) ch(n) is always the left node of n n n, then this term will returns 1 when the path goes left, and -1 if right. Furthermore, this term provides normalization. At a node n n n, if we sum the probabilities for going to the left and right node, we can check that for any value of v n T v w i v_n^Tv_{w_i} vnTvwi, σ ( v n T v w i ) + σ ( − v n T v w i ) = 1 \sigma(v_n^Tv_{w_i})+\sigma(-v_n^Tv_{w_i})=1 σ(vnTvwi)+σ(vnTvwi)=1( recall the graph of the sigmoid function ). The normalization also ensures that ∑ w = 1 ∣ V ∣ P ( w ∣ w i ) = 1 \sum^{|V|}_{w=1}P(w|w_i)=1 w=1VP(wwi)=1, just as in the original softmax.

  • v n ( w , j ) T v w i v_{n(w,j)}^Tv_{w_i} vn(w,j)Tvwi : compare the similarity of our input vector v w i v_{w_i} vwi to each inner node vector v n ( w , j ) T v^T_{n(w,j)} vn(w,j)T using a dot product.

example

Figure

Taking w 2 w_2 w2 for example, we must take two left edges and then a right edge to reach w 2 w_2 w2 from the root, so
P ( w 2 ∣ w i ) = p ( n ( w 2 , 1 ) , l e f t ) ⋅ p ( n ( w 2 , 2 ) , l e f t ) ⋅ p ( n ( w 2 , 3 ) , r i g h t ) = σ ( v n ( w 2 , 1 ) T v w i ) ⋅ σ ( v n ( w 2 , 2 ) T v w i ) ⋅ σ ( − v n ( w 2 , 3 ) T v w i ) \begin{aligned} P(w_2|w_i)&=p(n(w_2,1),left)\cdot p(n(w_2,2),left)\cdot p(n(w_2,3),right)\\ &= \sigma(v^T_{n(w_2,1)}v_{w_i})\cdot \sigma(v^T_{n(w_2,2)}v_{w_i})\cdot \sigma(-v^T_{n(w_2,3)}v_{w_i}) \end{aligned} P(w2wi)=p(n(w2,1),left)p(n(w2,2),left)p(n(w2,3),right)=σ(vn(w2,1)Tvwi)σ(vn(w2,2)Tvwi)σ(vn(w2,3)Tvwi)
More details

train objective:

our goal is still to minimize the negative log likelihood − log ⁡ P ( w ∣ w i ) -\log P(w|w_i) logP(wwi).But instead of updating output vectors per word. we update the vectors of the nodes in the binary tree that are in the path from root to leaf node.

speed of training:

determined by the way in which the binary tree is constructed and words are assigned to leaf nodes. Mikolov ( who invented the Hierarchical Softmax algorithm ) use a binary Huffman tree, which assigns frequent words shorter path in the tree.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 14
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 14
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值