Lect1: World Vectors


1. SVD Based Methods:

  • Word-Document Co-occurrence Matrix: give general topics leading to "Latent Semantic Analysis", large matrix \(\mathbb{R}^{V\times M}\) and scales with the number of documents (M)
  • Window based Co-occurrence Matrix: window around each word captures both syntactic (POS) and semantic information
    The Methods:
  • perform SVD on the matrix \(X=USV^T\)
  • cut the singular values off at some index k based on the desired percentage variance captured
  • take the submatrix \(U_{1:V,1:k}\) be the word embedding matrix
  • this gives a k-dimensional representation of every word
    The Problems:
  • the dimensions of the matrix change very often
  • the matrix is extremely sparse
  • quadratic cost to train (SVD)
  • requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency
    Some Solutions:
  • ignore function words
  • apply a ramp window, i.e. weight the co-occurrence count based on distance between the words
  • use Pearson correlations instead of counts, then set negative counts to 0

2. Iterative Based Methods:

Backpropagating:
For each iteration, we

  • run model
  • evaluate errors
  • update model

2.1: Bigram model:
\[p(w_1,w_2,\cdots,w_n)=\Pi_{i=2}^np(w_i|w_{i-1})\]

Continuous Bag of Words Model (CBOW):

  • known parameters: the sentence represented by one-hot vectors, input \(x^{(c)}\), output \(y^{(c)}\)
  • unknowns:
    * \(w_i\): word \(i\) from vocabulary \(V\)
    * \(\mathcal{V}\in\mathbb{R}^{n\times V}\): input word matrix
    * \(v_i\): \(i\)-th column of \(\mathcal{V}\), the input vector representation of word \(w_i\)
    * \(\mathcal{U}\in\mathbb{R}^{n\times V}\): output word matrix
    * \(u_i\): \(i\)-th row of \(\mathcal{U}\), the output vector representation of word \(w_i\)

Procedure:

  • generate one hot vectors \((x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)})\) for the input context of size \(m\)
  • get the embedded word vectors for the context \((v_i=\mathcal{V}x^{(i)})\)
  • average these vectors to get \(\hat{v}=\frac{v_{c-m}+v_{c-m+1}+\cdots+v_{c+m}}{2m}\)
  • generate a score vector \(z=\mathcal{U}\hat{v}\)
  • turn the scores into probabilities: \(\hat{y}=softmax(z)\)
  • we desire the generated probabilities \(\hat{y}\) to match the true probabilities \(y\), which happens to be the one hot vector of the actual word

Learn Weight Matrices:

  • use cross entropy \(H(\hat{y},y)=-\sum_{j=1}^Vy_j\log(\hat{y}_j)\) as the loss function
  • formulate the optimization objective as: \(\min J=-\log p(w_c|w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m})=-\log p(u_c|\hat{v})=-\log\frac{exp(u_c^T\hat{v})}{\sum_{j=1}^Vexp(u_j^T\hat{v})}\)
    \(=-u_c^T\hat{v}+\log\sum_{j=1}^Vexp(u_j^T\hat{v})\)
  • use stochastic gradient descent to update \(u\) and \(v\)
  • either keep around hash for word vectors or only update certain columns of full embedding matrix \(\mathcal{U}\) and \(\mathcal{V}\) because the gradient is very sparse

1106900-20170216142214550-1137312360.png

2.2: Skip-Gram Model:
essentially swap \(x\) and \(y\) in CBOW:

  • generate one hot input vector \(x\)
  • get embedded word vectors for the context \(v_c=\mathcal{V}x\)
  • not averaging, just set \(\hat{v}=v_c\)
  • generate \(2m\) score vectors: \(u_{c-m,\cdots,u_{c-1},u_{c+1},\cdots,u_{c+m}}\) using \(u=\mathcal{U}v_c\)
  • turn each of the scores into probabilities: \(y=softmax(u)\)
  • we desire the generated probability vector to match the true probabilities which is \(y^{(c-m)},\cdots,y^{(c-1)},y^{(c+1)},\cdots,y^{(c+m)}\), the one hot vectors of the actual output

Strong Conditional Independent Assumption:
\(\min J=-\log p(w_{c-m},\cdots,w_{c-1},w_{c+1},\cdots,w_{c+m}|w_c)=-\log\Pi_{j=0,j\neq m}^{2m}p(w_{c-m+j}|w_c)\)
\(=-\log\Pi_{j=0,j\neq m}^{2m}p(u_{c-m+j}|v_c)=-\log\Pi_{j=0,j\neq m}^{2m}\frac{exp(u_{c-m+j}^Tv_c)}{\sum_{k=1}^Vexp(u_k^Tv_c)}\)
\(=-\sum_{j=0,j\neq m}^{2m}u_{c-m+j}^Tv_c+2m\log\sum_{k=1}^Vexp(u_k^Tv_c)\)
1106900-20170216151103488-1041548395.png

2.3: Negative Sampling

  • approximate the term \(\sum_{k=1}^Vexp(u_k^Tv_c)\)
  • instead of looping over the entire vocabulary, just sample several negative examples
  • take \(\theta\) to be the parameters of the model:
    \(\theta=arg\max_{\theta}\Pi_{(w,c)\in D}p(D=1|w,c\theta)\Pi_{(w,c)\in\tilde{D}}p(D=0|w,c,\theta)\)
    \(=arg\max_{\theta}\Pi_{(w,c)\in D}p(D=1|w,c\theta)\Pi_{(w,c)\in\tilde{D}}(1-p(D=1|w,c,\theta))\)
    \(=arg\max_{\theta}\sum_{(w,c)\in D}\log p(D=1|w,c\theta)+\sum_{(w,c)\in\tilde{D}}\log(1-p(D=1|w,c,\theta))\)
    \(=arg\max_{\theta}\sum_{(w,c)\in D}\log\frac{1}{1+exp(-u_w^Tv_c)}+\sum_{(w,c)\in\tilde{D}}\log(\frac{1}{1+exp(u_w^Tv_c)})\)
  • objective function:
    \(\log\sigma(u_{c-m+j}^Tv_c)+\sum_{k=1}^K\log\sigma(-\tilde{u}_k^Tv_c)\)
    where \(\{\tilde{u}_k|k=1,\cdots,K\}\) are sampled form \(P_n(w)\)
  • what seems to work best as \(P_n(w)\) is the Unigram Model raised to the power of \(\frac{3}{4}\)

转载于:https://www.cnblogs.com/cihui/p/6403591.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值