Lect1: World Vectors

最新推荐文章于 2024-08-13 16:29:53 发布

weixin_30929195

最新推荐文章于 2024-08-13 16:29:53 发布

阅读量62

点赞数

原文链接：http://www.cnblogs.com/cihui/p/6403591.html

版权

1. SVD Based Methods:

Word-Document Co-occurrence Matrix: give general topics leading to "Latent Semantic Analysis", large matrix \(\mathbb{R}^{V\times M}\) and scales with the number of documents (M)
Window based Co-occurrence Matrix: window around each word captures both syntactic (POS) and semantic information
The Methods:
perform SVD on the matrix \(X=USV^T\)
cut the singular values off at some index k based on the desired percentage variance captured
take the submatrix \(U_{1:V,1:k}\) be the word embedding matrix
this gives a k-dimensional representation of every word
The Problems:
the dimensions of the matrix change very often
the matrix is extremely sparse
quadratic cost to train (SVD)
requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency
Some Solutions:
ignore function words
apply a ramp window, i.e. weight the co-occurrence count based on distance between the words
use Pearson correlations instead of counts, then set negative counts to 0

2. Iterative Based Methods:

Backpropagating:
For each iteration, we

run model
evaluate errors
update model

2.1: Bigram model:
\[p(w_1,w_2,\cdots,w_n)=\Pi_{i=2}^np(w_i|w_{i-1})\]

Continuous Bag of Words Model (CBOW):

known parameters: the sentence represented by one-hot vectors, input \(x^{(c)}\), output \(y^{(c)}\)
unknowns:
* \(w_i\): word \(i\) from vocabulary \(V\)
* \(\mathcal{V}\in\mathbb{R}^{n\times V}\): input word matrix
* \(v_i\): \(i\)-th column of \(\mathcal{V}\), the input vector representation of word \(w_i\)
* \(\mathcal{U}\in\mathbb{R}^{n\times V}\): output word matrix
* \(u_i\): \(i\)-th row of \(\mathcal{U}\), the output vector representation of word \(w_i\)

Procedure:

generate one hot vectors \((x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)})\) for the input context of size \(m\)
get the embedded word vectors for the context \((v_i=\mathcal{V}x^{(i)})\)
average these vectors to get \(\hat{v}=\frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m}\)
generate a score vector \(z=\mathcal{U}\hat{v}\)
turn the scores into probabilities: \(\hat{y}=softmax(z)\)
we desire the generated probabilities \(\hat{y}\) to match the true probabilities \(y\), which happens to be the one hot vector of the actual word

Learn Weight Matrices:

use cross entropy \(H(\hat{y},y)=-\sum_{j=1}^Vy_j\log(\hat{y}_j)\) as the loss function
formulate the optimization objective as: \(\min J=-\log p(w_c|w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m})=-\log p(u_c|\hat{v})=-\log\frac{exp(u_c^T\hat{v})}{\sum_{j=1}^Vexp(u_j^T\hat{v})}\)
\(=-u_c^T\hat{v}+\log\sum_{j=1}^Vexp(u_j^T\hat{v})\)
use stochastic gradient descent to update \(u\) and \(v\)
either keep around hash for word vectors or only update certain columns of full embedding matrix \(\mathcal{U}\) and \(\mathcal{V}\) because the gradient is very sparse

2.2: Skip-Gram Model:
essentially swap \(x\) and \(y\) in CBOW:

generate one hot input vector \(x\)
get embedded word vectors for the context \(v_c=\mathcal{V}x\)
not averaging, just set \(\hat{v}=v_c\)
generate \(2m\) score vectors: \(u_{c-m,\cdots,u_{c-1},u_{c+1},\cdots,u_{c+m}}\) using \(u=\mathcal{U}v_c\)
turn each of the scores into probabilities: \(y=softmax(u)\)
we desire the generated probability vector to match the true probabilities which is \(y^{(c-m)},\cdots,y^{(c-1)},y^{(c+1)},\cdots,y^{(c+m)}\), the one hot vectors of the actual output

Strong Conditional Independent Assumption:
\(\min J=-\log p(w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m}|w_c)=-\log\Pi_{j=0,j\neq m}^{2m}p(w_{c-m+j}|w_c)\)
\(=-\log\Pi_{j=0,j\neq m}^{2m}p(u_{c-m+j}|v_c)=-\log\Pi_{j=0,j\neq m}^{2m}\frac{exp(u_{c-m+j}^Tv_c)}{\sum_{k=1}^Vexp(u_k^Tv_c)}\)
\(=-\sum_{j=0,j\neq m}^{2m}u_{c-m+j}^Tv_c+2m\log\sum_{k=1}^Vexp(u_k^Tv_c)\)

2.3: Negative Sampling

approximate the term \(\sum_{k=1}^Vexp(u_k^Tv_c)\)
instead of looping over the entire vocabulary, just sample several negative examples
take \(\theta\) to be the parameters of the model:
\(\theta=arg\max_{\theta}\Pi_{(w,c)\in D}p(D=1|w,c\theta)\Pi_{(w,c)\in\tilde{D}}p(D=0|w,c,\theta)\)
\(=arg\max_{\theta}\Pi_{(w,c)\in D}p(D=1|w,c\theta)\Pi_{(w,c)\in\tilde{D}}(1-p(D=1|w,c,\theta))\)
\(=arg\max_{\theta}\sum_{(w,c)\in D}\log p(D=1|w,c\theta)+\sum_{(w,c)\in\tilde{D}}\log(1-p(D=1|w,c,\theta))\)
\(=arg\max_{\theta}\sum_{(w,c)\in D}\log\frac{1}{1+exp(-u_w^Tv_c)}+\sum_{(w,c)\in\tilde{D}}\log(\frac{1}{1+exp(u_w^Tv_c)})\)
objective function:
\(\log\sigma(u_{c-m+j}^Tv_c)+\sum_{k=1}^K\log\sigma(-\tilde{u}_k^Tv_c)\)
where \(\{\tilde{u}_k|k=1,\cdots,K\}\) are sampled form \(P_n(w)\)
what seems to work best as \(P_n(w)\) is the Unigram Model raised to the power of \(\frac{3}{4}\)

转载于:https://www.cnblogs.com/cihui/p/6403591.html

weixin_30929195

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lect1: World Vectors

1. SVD Based Methods:Word-Document Co-occurrence Matrix: give general topics leading to "Latent Semantic Analysis", large matrix \(\mathbb{R}^{V\times M}\) and scales with the number of documents ...
复制链接

扫一扫