干货: Skip-gram 详细推导加分析

往期文章链接目录

Comparison between CBOW and Skip-gram

The major difference is that skip-gram is better for infrequent words than CBOW in word2vec. For simplicity, suppose there is a sentence “ w 1 w 2 w 3 w 4 w_1w_2w_3w_4 w1w2w3w4”, and the window size is 1 1 1.

For CBOW, it learns to predict the word given a context, or to maximize the following probability

p ( w 2 ∣ w 1 , w 3 ) ⋅ P ( w 3 ∣ w 2 , w 4 ) p(w_2|w_1,w_3) \cdot P(w_3|w_2,w_4) p(w2w1,w3)P(w3w2,w4)

This is an issue for infrequent words, since they don’t appear very often in a given context. As a result, the model will assign them a low probabilities.

For Skip-gram, it learns to predict the context given a word, or to maximize the following probability

P ( w 2 ∣ w 1 ) ⋅ P ( w 1 ∣ w 2 ) ⋅ P ( w 3 ∣ w 2 ) ⋅ P ( w 2 ∣ w 3 ) ⋅ P ( w 4 ∣ w 3 ) ⋅ P ( w 3 ∣ w 4 ) P(w_2|w_1) \cdot P(w_1|w_2) \cdot P(w_3|w_2) \cdot P(w_2|w_3) \cdot P(w_4|w_3) \cdot P(w_3|w_4) P(w2w1)P(w1w2)P(w3w2)P(w2w3)P(w4w3)P(w3w4)

In this case, two words (one infrequent and the other frequent) are treated the same. Both are treated as word AND context observations. Hence, the model will learn to understand even rare words.

Skip-gram

Main idea of Skip-gram

  • Goal: The Skip-gram model aims to learn continuous feature representations for words by optimizing a neighborhood preserving likelihood objective.

  • Assumption: The Skip-gram objective is based on the distributional hypothesis which states that words in similar contexts tend to have similar meanings. That is, similar words tend to appear in similar word neighborhoods.

  • Algorithm: It scans over the words of a document, and for every word it aims to embed it such that the word’s features can predict nearby words (i.e., words inside some context window). The word feature representations are learned by optmizing the likelihood objective using SGD with negative sampling.

Skip-gram model formulation

Skip-gram learns to predict the context given a word by optimizing the likelihood objective. Suppose now we have a sentence

"I am writing a summary for NLP." \text{"I am writing a summary for NLP."} "I am writing a summary for NLP."

and the model is trying to predict context words given a target word “summary” with window size 2 2 2:

I am [ ] [ ] summary [ ] [ ] .  \text {I am [ ] [ ] summary [ ] [ ] . } I am [ ] [ ] summary [ ] [ ] . 

Then the model tries to optimize the likelihood

P ( "writing" ∣ "summary" ) ⋅ P ( "a" ∣ "summary" ) ⋅ P ( "for" ∣ "summary" ) ⋅ P ( "NLP" ∣ "summary" ) P(\text{"writing"}|\text{"summary"}) \cdot P(\text{"a"}|\text{"summary"}) \cdot P(\text{"for"}|\text{"summary"}) \cdot P(\text{"NLP"}|\text{"summary"}) P("writing""summary")P("a""summary")P("for""summary")P(&#

  • 4
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值