# 论文分享-->GloVe: Global Vectors for Word Representation

24 篇文章 3 订阅

#### Glove Model

• wordwordoccurence $word-word-occurence$：即是共现矩阵，定义为 X $X$
• Xi,j$X_{i, j}$表示 wordj $word_{j}$ 出现在 wordi $word_{i}$ 周边　的次数。
• Xi=kXik $X_{i} = \sum_{k}X_{ik}$
• Pi,j=P(j|i)=Xij/Xi $P_{i,j} = P(j| i) = X_{ij}/ X_{i}$

• 如果选择 wordk $word_{k}$ solid $solid$，则 Pik/Pjk $P_{ik}/P_{jk}$ 会非常大。
• 如果选择 wordk $word_{k}$ gas $gas$，则 Pik/Pjk $P_{ik}/P_{jk}$ 会非常小。
• 如果选择 wordk $word_{k}$ water $water$ fashion $fashion$，则 Pik/Pjk $P_{ik}/P_{jk}$ 会非常接近１。

F(wi,wj,w̃ k)=PikPjk

F(wiwj,w̃ k)=PikPjk

F((wiwj)Tw̃ k)=PikPjk

F((wiwj)Tw̃ k)=PikPjkPkiPkj

F((wiwj)Tw̃ k)F(wTiw̃ k)F(wTjw̃ k)==F(w̃ Tkwi)F(w̃ Tkwj)

F(wTiw̃ k)=Pik=XikXi

log(Xik)=wTiw̃ k+log(Xi)

log(Xik)=wTiw̃ k+bi+b̃ k

J=i,j=1Vf(Xij)(wTiw̃ j+bi+b̃ jlogXij)

f(x)={(x/xmax)α1if(x<xmax)otherwise

#### 实现代码分析

##### 构建词表

def build_vocab(corpus):
"""
Build a vocabulary with word frequencies for an entire corpus.

Returns a dictionary w -> (i, f), mapping word strings to pairs of
word ID and word corpus frequency.
"""

logger.info("Building vocab from corpus")

vocab = Counter()
for line in corpus:
┆   tokens = line.strip().split()
┆   vocab.update(tokens)

logger.info("Done building vocab from corpus.")

return {word: (i, freq) for i, (word, freq) in enumerate(vocab.iteritems())}


#### 构建共现矩阵

def build_cooccur(vocab, corpus, window_size=10, min_count=None):
"""
Build a word co-occurrence list for the given corpus.

This function is a tuple generator, where each element (representing
a cooccurrence pair) is of the form

┆   (i_main, i_context, cooccurrence)

where i_main is the ID of the main word in the cooccurrence and
i_context is the ID of the context word, and cooccurrence is the
X_{ij} cooccurrence value as described in Pennington et al.
(2014).

If min_count is not None, cooccurrence pairs where either word
occurs in the corpus fewer than min_count times are ignored.
"""

vocab_size = len(vocab)
id2word = dict((i, word) for word, (i, _) in vocab.iteritems())

# Collect cooccurrences internally as a sparse matrix for passable
# indexing speed; we'll convert into a list later
cooccurrences = sparse.lil_matrix((vocab_size, vocab_size),
┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆ dtype=np.float64)

for i, line in enumerate(corpus):
┆   if i % 1000 == 0:
┆   ┆   logger.info("Building cooccurrence matrix: on line %i", i)

┆   tokens = line.strip().split()
┆   token_ids = [vocab[word][0] for word in tokens]

┆   for center_i, center_id in enumerate(token_ids):
┆   ┆   # Collect all word IDs in left window of center word
┆   ┆   context_ids = token_ids[max(0, center_i - window_size) : center_i]
┆   ┆   contexts_len = len(context_ids)

┆   ┆   for left_i, left_id in enumerate(context_ids):
┆   ┆   ┆   # Distance from center word
┆   ┆   ┆   distance = contexts_len - left_i

┆   ┆   ┆   # Weight by inverse of distance between words
┆   ┆   ┆   increment = 1.0 / float(distance)

┆   ┆   ┆   # Build co-occurrence matrix symmetrically (pretend we
┆   ┆   ┆   # are calculating right contexts as well)
┆   ┆   ┆   cooccurrences[center_id, left_id] += increment
┆   ┆   ┆   cooccurrences[left_id, center_id] += increment

# Now yield our tuple sequence (dig into the LiL-matrix internals to
# quickly iterate through all nonzero cells)
for i, (row, data) in enumerate(itertools.izip(cooccurrences.rows,
┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆   ┆  cooccurrences.data)):
┆   if min_count is not None and vocab[id2word[i]][1] < min_count:
┆   ┆   continue
┆   for data_idx, j in enumerate(row):
┆   ┆   if min_count is not None and vocab[id2word[j]][1] < min_count:
┆   ┆   ┆   continue

┆   ┆   yield i, j, data[data_idx]


##### 初始化参数

J=i,j=1Vf(Xij)(wTiw̃ j+bi+b̃ jlogXij)

W = (np.random.rand(vocab_size * 2, vector_size) - 0.5) / float(vector_size + 1)# 这里面vocab_size * 2，上半voab_size个存储i_main词的向量，下半部分存储其i_context的词向量。这里我们选择的词向量的dim为vocab_size,其实你可以按照具体情况选取不一样的dim

biases = (np.random.rand(vocab_size * 2) - 0.5) / float(vector_size + 1)

## 后面反向求导用到。
gradient_squared = np.ones((vocab_size * 2, vector_size),　dtype=np.float64)
gradient_squared_biases = np.ones(vocab_size * 2, dtype=np.float64)

data = [(W[i_main], W[i_context + vocab_size],　biases[i_main : i_main + 1], biases[i_context + vocab_size : i_context + vocab_size + 1], gradient_squared[i_main], gradient_squared[i_context + vocab_size], gradient_squared_biases[i_main : i_main + 1], gradient_squared_biases[i_context + vocab_size: i_context + vocab_size + 1], cooccurrence) for i_main, i_context, cooccurrence in cooccurrences]

##### 模型训练
    for (v_main, v_context, b_main, b_context, gradsq_W_main, gradsq_W_context,

┆   weight = (cooccurrence / x_max) ** alpha if cooccurrence < x_max else 1

┆   # Compute inner component of cost function, which is used in
┆   # both overall cost calculation and in gradient calculation
┆   #
┆   #   $$J' = w_i^Tw_j + b_i + b_j - log(X_{ij})$$
┆   cost_inner = (v_main.dot(v_context)
┆   ┆   ┆   ┆   ┆ + b_main[0] + b_context[0]
┆   ┆   ┆   ┆   ┆ - log(cooccurrence))

┆   # Compute cost
┆   #
┆   #   $$J = f(X_{ij}) (J')^2$$
┆   cost = weight * (cost_inner ** 2)

┆   # Add weighted cost to the global cost tracker
┆   global_cost += 0.5 * cost

┆   # Compute gradients for word vector terms.
┆   #
┆   # NB: main_word is only a view into W (not a copy), so our
┆   # modifications here will affect the global weight matrix;
┆   # likewise for context_word, biases, etc.
┆   grad_main = weight * cost_inner * v_context
┆   grad_context = weight * cost_inner * v_main

┆   # Compute gradients for bias terms
┆   grad_bias_main = weight * cost_inner
┆   grad_bias_context = weight * cost_inner

┆   b_context -= (learning_rate * grad_bias_context / np.sqrt(

┆   # Update squared gradient sums


#### 与word2Vec 的区别与联系

• skip_gram $skip\_gram$ 方法中最后一步的 softmax $softmax$ 后，我们希望其周边的词的概率越大越好，这体现在 word2Vec $word2Vec$ 的损失函数上面，我个人感觉这样没有考虑 word_pairs $word\_pairs$ 之间的距离因素，而在 glove $glove$中考虑到了，上述代码中有体现。
• 感觉 word2Vecglove $word2Vec、glove$ 都是在考虑了共现矩阵的基础上建立模型，只是 word2Vec $word2Vec$ 是一种预测型模型，而 glove $glove$ 是基于计数的模型。
• word2Vec $word2Vec$ 是一种预测型模型，在计算 loss $loss$ 时，我们希望其 window_size $window\_size$ 内的单词的概率能够尽可能的高，我们可以用 SGD $SGD$ 不断训练这个前向神经网络，使其能够学习到较好的 word_repesentation $word\_repesentation$
• Glove $Glove$ 呢？是一种基于计数的模型，首先会构造一个很大的共现矩阵，就是上述代码中的 cooccurrences $cooccurrences$ 矩阵，其 shape $shape$ [vocab_size,vocab_size] $[vocab\_size, vocab\_size]$，因此我们需要对其进行降纬，降维后的 shape $shape$ [vocab_size,dim] $[vocab\_size, dim]$，该矩阵的每一行的向量可以看做该单词的表示，我们可以不断的最小化 reconstruction loss $reconstruction\ loss$ 来寻找这样一个矩阵。
• 9
点赞
• 36
收藏
觉得还不错? 一键收藏
• 6
评论
11-18
08-29 379

### “相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。