CS 224n Assignment #2: word2vec (written部分)

CS 224n Assignment #2: word2vec (written部分)

understanding word2vec

==The key insight behind word2vec is that ‘a word is known by the company it keeps’. == Concretely, suppose we have a ‘center’ word c and a contextual window surrounding c. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P ( O ∣ C ) P(O|C) P(OC). Given a specific word o and a specific word c, we want to calculate P ( O = o ∣ C = c ) P(O = o|C = c) P(O=oC=c), which is the probability that word o is an ‘outside’ word for c, i.e., the probability that o falls within the contextual window of c.
在片描述
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
在这里插入图片描述
Here, u o u _o uo is the ‘outside’ vector representing outside word o, and v_c is the ‘center’ vector representing center word c. To contain these parameters, we have two matrices, U U U and V V V . The columns of U U U are all the ‘outside’ vectors u w u_w uw. The columns of V V V are all of the ‘center’ vectors v w v_w vw. Both U U U and
V V V contain a vector for every w ∈ V o c a b u l a r y . 1 w ∈ Vocabulary.^1 wVocabulary.1

Recall from lectures that, for a single pair of words c and o, the loss is given by:

在这里插入图片描述
Another way to view this loss is as the c r o s s − e n t r o p y 2 cross-entropy^2 crossentropy2 between the true distribution y and the predicted distribution y ˆ yˆ yˆ. Here, both y and y ˆ yˆ yˆ are vectors with length equal to the number of words in the vocabulary. Furthermore, the k t h k^{th} kth entry in these vectors indicates the conditional probability of the k t h k^{th} kth word being an ‘outside word’ for the given c. The true empirical distribution y is a one-hot vector with a 1 for the true outside word o, and 0 everywhere else. The predicted distribution y ˆ yˆ yˆ is the probability distribution P ( O ∣ C = c ) P(O|C = c) P(OC=c) given by our model in equation (1).


1 ^1 1Assume that every word in our vocabulary is matched to an integer number k. u k u_k uk is both the k t h k^th kth column of U and the ‘outside’ word vector for the word indexed by k. v k v_k vk is both the k t h k^th kth column of V and the ‘center’ word vector for the word indexed by k. In order to simplify notation we shall interchangeably use k to refer to the word and the index-of-the-word.
2 ^2 2The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is − ∑ i p i l o g ( q i ) -\sum_{i}p_ilog(q_i) ip

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值