SVD Based Methods
1.1 Word-Document Matrix
1.2 Window based CO-occurrence Matrix
In this method we count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus.
1.3 Advantages:Both of these methods give us word vectors that are more than sufficient to encode semantic and syntacic
1.4 Shortcomming:
★ The dimensions of the matrix change very often(new words are added very frequently and corpus changes in size)
★ The matrix is extremely sparse since most words do not co-occur
★ The matrix is very hign dimensional in general
★ Quadratic cost to train(perform SVD)
Iteration Based Methods
2.1 CBOW Model
▴ key idea: Predicting a center word from the surrounding context
▴ unkonwns: Two matrics, V∈Rn×|V| and U∈R|V|×n
▴ Notation for CBOW Model:
wi :Word i from vocabulary V
V∈Rn×|V| :Input word matrix
vi :the input vector representation of word wi
U∈Rn×|V| :Output word matrix
ui :the output vector representation of word wi
▴ Steps:
We generate our one hot word vector( x(c−m) ,…, x(c−1) , x(c+1) ,…, x(c+m) ) for the input context of size m.
We get our embedded word vectors for the context ( Vc−m=Vx(c−m) , Vc−m+1=Vx(c−m+1) ,… Vc+m=Vx(c+m) )
Average these vectors to get v̂ =vc−m+vc−m+1+...+vc+m2m
Generate a score vector z=Uv̂
Turn the scores into probabilities ŷ =sofrmax(z)
We desire our probabilities generated, ŷ ,to match the true probabilities,y,which also happens to be the one hot vector of the actual word
2.2 Skip-Gram Model
▴ key idea: Predicting surrounding context words given a center word
▴ steps:
We generate our one hot input vector x
We get our embedded word vectors for the context vc=Vx
Since there is no averaging,just set v̂ =vc
Generate 2m score vectors, uc−m,...,uc−1,uc+1,...,uc+m using u=Uvc
Turn each of the scores into probabilitiesm, y=softmax(u)
We desire our probability vector generated to match the true probabilities which is yc−m,...,yc−1,yc+1,...,yc+m ,the one hot vecotrs of the actual output