第2章词向量表示 GloVe word2vec skip-gram CBOW

最新推荐文章于 2025-04-12 23:00:34 发布

原创最新推荐文章于 2025-04-12 23:00:34 发布

· 1.2k 阅读

2 ·

版权

文章标签：

#python #tensorflow #自然语言处理 #深度学习 #人工智能

书籍同时被 2 个专栏收录

6 篇文章

订阅专栏

面向自然语言处理的深度学习

2 篇文章

订阅专栏

1 如何在模型中表示文本数据？

$\quad$ 众所周知，文本数据属于非结构化数据，那么如何在模型中恰当地对文本数据进行表示呢？
$\quad$ 他山之石，可以攻玉~我们可以参考语音和图像领域的做法。在语音领域，研究人员将音频频谱序列向量所构成的矩阵作为输入，而在图像领域，研究人员则是将图片的像素构成的矩阵展平成向量作为输入。那么，在文本领域呢？将一个个单词作为输入？将一个个单词表示的向量作为输入？两种思路对应了两种常用的词向量模型，前者对应单热编码模型，后者则对应分布式模型。
$\quad$ 值得注意的是，在语音和图像领域，最基本的数据是信号数据，通过一些距离度量就可以判断信号是否相似。然而，语言作为人类在进化了几百万年所产生的一种高层的抽象的思维信息表达的工具，具有高度抽象的特征，因此文本是符号数据，两个词只要字面不同，就难以刻画它们之间的联系，即使是“麦克风”和“话筒”这样的同义词，从字面上也难以看出这两者意思相同（语义鸿沟现象），可能并不是简单地一加一那么简单就能表示出来，而判断两个词是否相似时，还需要更多的背景知识才能做出回答。

1.1 独热编码（one-hot encoding）

$\quad$ one-hot顾名思义，就是每一个单词经过one-hot编码后，只有一个位置的元素为1，其他位置全为0。
$\quad$ 对于The boy is headsome.这句话，经过独热编码之后，the对应（1,0,0,0），boy对应（0,1,0,0），is对应（0,0,1,0），headsome对应（0,0,0,1）。倘若语料库的单词个数较多，那么就需要大量的空间来存储每个单词的向量。此时，常采用稀疏方式进行存储，即使用Hash函数给每个单词分配一个ID，例如给the分配0，给boy分配1，依次类推。
$\quad$ 尽管独热编码已经在很多模型中得到了成功地应用，但还是存在着两大不足之处。首先，向量的维度会随着单词数量的增加而增大。其次，任意两个单词之间都是孤立的，无法表示出在语义层面单词之间的相关关系，例如在独热编码下，“麦克风”向量到“话筒”向量和到“水果”向量的距离一样大，都是 $\sqrt{2}$ 。

1.2 分布式表示（distributed representation）

$\quad$ 分布式表示的目标就是将单词的语义信息融合入单词的表示当中。Harris 在 1954 年提出的分布假说（ distributional hypothesis）为这一设想提供了理论基础：上下文相似的词，其语义也相似。Firth 在 1957 年对分布假说进行了进一步阐述和明确：词的语义由其上下文决定（ a word is characterized by thecompany it keeps）。基于分布假说的词表示方法，根据建模的不同，主要可以分为三类：基于矩阵的分布表示、基于聚类的分布表示和基于神经网络的分布表示。尽管这些不同的分布表示方法使用了不同的技术手段获取词表示，但由于这些方法均基于分布假说，它们的核心思想也都由两部分组成：一、选择一种方式描述上下文；二、选择一种模型刻画某个词即目标词与其上下文之间的关系。

2 基于矩阵的分布表示

$\quad$ 基于矩阵的分布表示通常又称为分布语义模型，在这种表示下，矩阵中的一行，就成为了对应词的表示，这种表示描述了该词的上下文的分布。由于分布假说认为上下文相似的词，其语义也相似，因此在这种表示下，两个词的语义相似度可以直接转化为两个向量的空间距离。
$\quad$ 常见的Global Vector 模型（GloVe模型）就是一种对“词-词”矩阵进行分解从而得到词表示的方法，属于基于矩阵的分布表示。

2.1 GloVe模型

2.1.1 论文

Pennington J , Socher R , Manning C . Glove: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.

2.1.2 简介

$\quad$ 语言的语义向量空间模型用实值向量表示每个单词。这些向量可以用作各种应用中的特征，例如信息检索（Manning等，2008），文档分类（Sebastiani，2002），问题回答（Tellex等，2003），名称识别（Turianetal。， 2010）和解析（Socher等，2013）。大多数单词矢量方法依赖于单词矢量对之间的距离或角度，作为评估这样一组单词表示的内在质量的主要方法。最近，Mikolov等人。（2013c）引入了一种基于词类比探测的新评估方案单词矢量空间的细胞结构通过检查两个矢量之间的差异，而不是它们各种各样的差异维度。例如，类比“国王是女王，男人是女人”，应该在矢量空间中通过矢量方程式王 - 女王=男女来编码。该评估方案倾向于产生意义维度的模型，从而捕获分布式表示的多聚类思想（Bengio，009）。
$\quad$ 学习词向量有两大方法：1）全局矩阵分解方法，比如LSA，2）局部文本窗口，比如skip-gram模型。这些方法都有缺点，LSA可以很好获得统计信息，但对于词的相似度任务比较差，而skip-gram对于相似度任务很好，但是基本使用语料的统计信息，因为该方法的训练聚焦于局部上下文窗口而不是全局共现对。
$\quad$ 作者采用全局对数-双线性回归模型（global log-bilinear regression models），在模型中使用了特殊的加权最小二乘模型来训练全局词与词的共现计数，实现了对统计数据的有效利用。该模型产生了一个具有具体含义的子结构的词向量空间，其在词类比数据集上75%的精确度证明了这一点。我们还证明了我们的方法在几个单词相似度任务上，以及在一个通用的命名字符识别(NER)基准测试上，都优于其他现有的方法。
$\quad$ 作者提供了源码网站

2.1.3 文献综述

$\quad$ 论文中回顾了矩阵分解方法和局部文本窗口的研究工作。与在推荐系统中的应用相似，矩阵分解方法同样通过引入隐因子来表示单词的含义。一部分研究对单词-文档矩阵进行分解（LSA），另一部分研究则对单词-单词矩阵（矩阵中的元素表示共线次数）进行分解（HAL）。局部文本窗口方法则是借助于窗口对单词的含义表示进行学习，包括skip-gram、CBOW、vLBL、ivLBL和PPMI等模型。

2.1.4 GloVe模型

$\quad$ 无监督学习从语料中习得词向量(word vector)表示的一个比较通用的方式是首先从语料中统计词的共现(word occurrences)。尽管已经存在许多方法，但是如何从统计信息中提取含义和如何对单词向量进行含义表示仍然是一个问题。为此，论文中提出了一种新的单词表示的模型Global Vectors（GloVe），该名的含义是模型可以直接对全局语料库统计信息进行捕捉。
$共现矩阵\\ X_{ij}表示单词j在单词i的情境下出现的次数\\ X_i = \sum_k X_{ik}表示单词i的情境下出现的单词总次数\\ P_{ij}=P(j|i)=\frac{X_{ij}}{X_i}表示单词j出现在单词i的情境下的概率$
$\quad$ 以“king is to queen as man is to woman”这句话为例。假设窗口大小为5（左右均为2），则

窗口标号	中心词	情境词
0	king	is, to
1	is	king, to, queen
2	to	king, is, queen, as
3	queen	is, to, as, man
4	as	to, queen, man, is
5	man	queen, as, is, to
6	is	as, man, to, woman
7	to	man, is, woman
8	woman	is, to

$\quad$ 那么X矩阵的表示如下

X	king	is	to	queen	as	man	woman
king	0	1	1	0	0	0	0
is	1	0	2	1	1	1	1
to	1	2	0	1	1	1	1
queen	0	1	1	0	1	1	0
as	0	1	1	1	0	1	0
man	0	1	1	1	1	0	0
woman	0	1	1	0	0	0	0

$\quad$ 那么，很显然 $X_{king, is}$ = 1，而 $X_{king}$ =2，则 $P_{king, is}$ = 0.5
$\quad$ 下面，作者关注 $P_{ik}/P_{jk}$ 这一指标。根据上面的定义， $P_{ik}$ 表示单词 $k$ 出现在单词 $i$ 的情境下的概率，而 $P_{jk}$ 表示单词 $k$ 出现在单词 $j$ 的情境下的概率。那么，易得当单词 $k$ 和单词 $i$ 的相关程度较高（即经常同时出现、单词 $k$ 经常出现在单词 $i$ 的情境下），而单词 $k$ 和单词 $j$ 的相关程度不高时， $P_{ik}/P_{jk}$ 会是一个较大的值。反之， $P_{ik}/P_{jk}$ 会是一个较小的值。当单词 $k$ 和单词 $i$ 、单词 $j$ 的相关程度差不多时， $P_{ik}/P_{jk}$ 会接近于1。作者在一个较大的语料库上的测试验证了上述想法。
在这里插入图片描述
$\quad$ 从上面的例子中可以发现， $P_{ik}/P_{jk}$ 可以很好地区分相关单词和不相关单词。值得注意的是， $P_{ik}/P_{jk}$ 中涉及到了三个单词 $i, j 和 k$ ，那么三个单词的词向量 $\omega_i, \omega_j, \vec{\omega_k}$ 的函数应与 $P_{ik}/P_{jk}$ 尽可能相等
$F(\omega_i, \omega_j, \vec{\omega_k})=\frac{P_{ik}}{P_{jk}}$
$\quad$ 下面作者开始对 $F$ 函数的形式进行研究。首先， $F$ 函数度量的是 $\omega_i, \omega_j$ 之间的差异，考虑到向量空间本质上是线性结构，最自然的方式这是矢量差异，那么可以将上式改写为
$F(\omega_i-\omega_j, \vec{\omega_k})=\frac{P_{ik}}{P_{jk}}$
$\quad$ 容易发现，在上式中 $P_{ik}/P_{jk}$ 是一个标量，而 $F$ 函数内部的参数都是矢量。尽管可以通过 $F$ 函数的复杂结构使得最终结果变为标量，但这样会混淆为了去捕捉的线性结构。因此，作者使用点积使得 $F$ 函数为标量，那么可以将上式改写为
$F\left[ (\omega_i-\omega_j)^\top\vec{\omega_k}\right] = \frac{P_{ik}}{P_{jk}}$
即
$F\left( \omega_i^\top\vec{\omega_k} -\omega_j^\top\vec{\omega_k}\right) = \frac{P_{ik}}{P_{jk}}$

$\quad$ 接下来，注意对于单词-单词共现矩阵，单词和上下文单词之间的区别是任意的，我们应当可以自由地交换这两个角色。为了满足这一性质，作者将左边的 $F$ 函数变为了下列形式
$F\left( \omega_i^\top\vec{\omega_k} -\omega_j^\top\vec{\omega_k}\right) = \frac{exp(\omega_i^\top\vec{\omega_k})}{exp(\omega_j^\top\vec{\omega_k})}$
$\quad$ 使用指数函数的原因在于可以方便得使得差变为商，因此 $F$ 函数就是指数函数。因此
$\frac{exp(\omega_i^\top\vec{\omega_k})}{exp(\omega_j^\top\vec{\omega_k})}= \frac{P_{ik}}{P_{jk}}$
$\quad$ 那么，实际上只需要使得
$P_{ij} = exp(\omega_i^\top\vec{\omega_j})$
对两边取对数，可得
$\omega_i^\top\vec{\omega_j} = log(P_{ij}) = log(X_{ij}) - log(X_i)$
但是，上式因为 $log(X_i)$ 的存在而不具有交换对称性。即 $\omega_i^\top\vec{\omega_j}$ 和 $\omega_j^\top\vec{\omega_i}$ 是相等的，但是 $log(X_{ij}) - log(X_i)$ 和 $log(X_{ji}) - log(X_j)$ 则不相等。为此，作者将
$log(X_{ij}) = \omega_i^\top\vec{\omega_j}+ log(X_i)$
$\quad$ 中的 $log(X_i)$ 吸收到偏置项 $b_i$ 之中，并引入偏置项 $b_j$ ，从而得到
$log(X_{ij}) = \omega_i^\top\vec{\omega_j}+ b_i + b_j$
$\quad$ 为了避免 $X_{ij}$ 为0的情况，可以将 $log(X_{ij})$ 变为 $log(X_{ij} + 1)$
$\quad$ 损失函数就可以写为下列形式
$\sum_{i,j}\left(log(X_{ij}) - \omega_i^\top\vec{\omega_k} - b_i - b_j\right)^2$
$\quad$ 最后，作者考虑到不同的单词对于模型精度的损失应有不同的权重，引入在损失函数中
$\sum_{i,j}f(X_{ij})\left(log(X_{ij}) - \omega_i^\top\vec{\omega_k} - b_i - b_j\right)^2$
$\quad$ 对于函数 $f(X_{ij})$ ，首先应该是一个非递减函数，因为共线次数越高，权重应越大。其次， $f (0)$ 应为0。最后，过多的频繁出现也不应该被过度强调。因此 $f (x)$ 的定义如下
$f(x)=\left\{ \begin{array}{rcl} (x/x_{max})^\alpha & & {if x < x_{max}}\\ 1 & & {otherwise} \end{array} \right.$
$\quad$ 在作者的实验中。 $x_{max}$ 设置为100，而 $\alpha$ 设置为3/4

3 基于神经网络的分布表示

3.1 模型介绍

$\quad$ Bengio提出的前馈神经网络语言模型（FNNLM）引入了前馈神经网络。
$\quad$ 由Tomas Mikolov等人引入的word2vec模型是最常应用的模型之一，用于学习词嵌入或单词的向量表示。word2vec模型在内部使用一个单层的简单神经网络，并捕获隐藏层的权重，模型训练的目的是学习隐藏层的权重。word2vec模型提供了一系列在n维空间中表示单词的模型，这些模型使得具有类似含义的单词和相似单词在空间中的位置互相接近。与独热编码相比，word2vec有助于减少编码空间的大小。两种常见的word2vec模型是skip-gram和连续词袋（CBOW）。前者使用中心词来预测上下文词，而后者则利用上下文或周围的单词来预测中心词。
$\quad$ 使用大小为2的窗口对输入文本“king is a man”进行训练采样。
$\quad$ 第一次采样，选择单词king，得到的训练采样为(king, is)和(king, a)
$\quad$ 第二次采样，选择单词is，得到的训练采样为(is, king)、(is, a)和(is, man)
$\quad$ 第三次采样，选择单词a，得到的训练采样为(a, king)、(a, is)、(a, man)
$\quad$ 第四次采样，选择单词man，得到的训练采样为(man, is)和(man, a)

3.2 频繁词二次采样和负采样

$\quad$ 词汇表由大量词频不等的独特单词构成，为了选择需要用于建模目的的单词，可以通过检查单词出现在语料库中的词频和单词的总数来决定删除哪些单词。Mikolov在论文中引入了二次采样的方法，该方法的引入使得训练速度得到了显著提升。
$\quad$ 生存函数用于计算单词的概率分数，它可以用于决定是否从词汇表中保留或移除该单词。该函数考虑了相关单词的词频和可以调整的二次采样率
$P(\omega_i) = (\sqrt\frac{z(\omega_i)}{s}+1)\frac{s}{z(\omega_i)}$
$\quad$ 其中， $\omega_i$ 是相关单词， $z(\omega_i)$ 是训练数据集或语料库中单词的频率，而 $s$ 是二次采样率，常设为0.001。
$\quad$ 不过，在论文中，二次采样公式如下
$P(\omega_i) = 1 - \sqrt\frac{t}{f(\omega_i)}$
$\quad$ 其中， $t$ 是所使用的阈值。在下文的实现过程中，也是使用的这个二次采样公式。
$\quad$ 如果只使用正样本，那么模型训练到最后可能最后返回的永远都是1，没有实际应用价值。因此，需要在训练集中引入负样本，即不是邻居的单词样本。负样本的选择概率取决于语料库中单词的词频，词频越高的单词被选为负单词的概率就越高。论文中提到的小型训练数据集中负样本的计数在5到20之间，而大型训练数据集，建议在2到5之间。

3.3 skip-gram和CBOW模型的实现

3.3.1 准备工作

$\quad$ 下载由Matt Mahoney收集和清理的维基百科文章数据集

import os
from six.moves.urllib.request import urlretrieve


def cbk(a, b, c):
    """回调函数 显示下载进度
    @a:已经下载的数据块
    @b:数据块的大小
    @c:远程文件的大小
    """
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print('%.2f%%' % per)


def data_download(dataset_link, zip_file, cbk):
    """Downloading the required file"""
    if not os.path.exists(zip_file):
        zip_file, _ = urlretrieve(dataset_link + zip_file, zip_file, cbk)
        print("File downloaded successfully!")
    return None


dataset_link = "http://mattmahoney.net/dc/"
zip_file = "text8.zip"
data_download(dataset_link, zip_file, cbk)

$\quad$ 对下载的文件进行解压缩

import os
import zipfile


extracted_folder = "dataset"
zip_file = "text8.zip"
if not os.path.isdir(extracted_folder):
    with zipfile.ZipFile(zip_file) as zf:
        zf.extractall(extracted_folder)

$\quad$ 由于输入数据额文本中有多个标点符号和其他符号，相同的符号将被替换为带有标点符号名称和符号类型的相应字符。这有助于让模型单独识别每个标点符号和其他符号并生成向量。

def text_processing(ft8_text):
    """Replacing punctuation marks with tokens"""
    ft8_text = ft8_text.lower()
    ft8_text = ft8_text.replace(".", " <period> ")
    ft8_text = ft8_text.replace(",", " <comma> ")
    ft8_text = ft8_text.replace("\"", " <quotation> ")
    ft8_text = ft8_text.replace(";", " <semicolon> ")
    ft8_text = ft8_text.replace("!", " <exclamation> ")
    ft8_text = ft8_text.replace("?", " <question> ")
    ft8_text = ft8_text.replace("(", " <paren_l> ")
    ft8_text = ft8_text.replace(")", " <paren_r> ")
    ft8_text = ft8_text.replace("--", " <hyphen> ")
    ft8_text = ft8_text.replace(":", " <colon> ")
    ft8_text_tokens = ft8_text.split()
    return ft8_text_tokens


with open("dataset/text8") as ft_:
    full_text = ft_.read()
ft_tokens = text_processing(full_text)

$\quad$ 去除数据集中词频小于7的单词

import collections
"""Shortlisting words with frequency more than 7"""
word_cnt = collections.Counter(ft_tokens)
shortlisted_words = [w for w in ft_tokens if word_cnt[w] > 7]

$\quad$ 创建字典将单词转换为整数

import collections
def dict_creation(shortlisted_words):
    """The function creates a dictionary of the words present in dataset along with their frequency order"""
    counts = collections.Counter(shortlisted_words)
    vocabulary = sorted(counts, key=counts.get, reverse=True)
    print(vocabulary)
    rev_dictionary_ = {ii: word for ii, word in enumerate(vocabulary)}
    dictionary_ = {word: ii for ii, word in rev_dictionary_.items()}
    return dictionary_, rev_dictionary_
dictionary_, rev_dictionary_ = dict_creation(shortlisted_words)
words_cnt = [dictionary_[word] for word in shortlisted_words]

3.3.2 tensorflow相关知识

$\quad$ TensorFlow是采用数据流图（Data　Flow　Graphs）来计算, 所以首先我们得创建一个数据流图，然后再将我们的数据放在数据流图中进行计算。图中的节点（Nodes）表示添加的操作，图中的边（edges）表示在节点间相互联系的多维数据数组，即张量（tensor)。训练模型时tensor会不断的从数据流图中的一个节点flow到另一节点,，这就是TensorFlow名字的由来。

3.3.2.1 tf.Graph()

$\quad$ tf.Graph() 表示实例化了一个类，一个用于 tensorflow 计算和表示用的数据流图，通俗来讲就是：在代码中添加的操作（画中的结点）和数据（画中的线条）都是画在纸上的“画”，而图就是呈现这些画的纸，你可以利用很多线程生成很多张图，但是默认图就只有一张。
$\quad$ tf.Graph().as_default() 表示将这个类实例，也就是新生成的图作为整个 tensorflow 运行环境的默认图，如果只有一个主线程不写也没有关系，tensorflow 里面已经存好了一张默认图，可以使用tf.get_default_graph() 来调用（显示这张默认纸），当你有多个线程就可以创造多个tf.Graph()，就是你可以有一个画图本，有很多张图纸，这时候就会有一个默认图的概念了。
$\quad$ 1、使用g = tf.Graph()函数创建新的计算图
$\quad$ 2、在with g.as_default():语句下定义属于计算图g的张量和操作
$\quad$ 3、在with tf.Session()中通过参数graph=xxx指定当前会话所运行的计算图
$\quad$ 4、如果没有显示指定张量和操作所属的计算图，则这些张量和操作属于默认计算图
$\quad$ 5、一个图可以在多个sess中运行，一个sess也能运行多个图

import tensorflow as tf
# 默认计算图上的操作
a = tf.constant([1.0, 2.0])
b = tf.constant([2.0, 3.0])
result = a + b

# 定义两个计算图
g1 = tf.Graph()
g2 = tf.Graph()

# 在g1中定义张量和操作
with g1.as_default():
    a = tf.constant([1.0, 1.0])
    b = tf.constant([1.0, 1.0])
    result1 = a + b

# 在g2中定义张量和操作
with g2.as_default():
    a = tf.constant([2.0, 2.0])
    b = tf.constant([2.0, 2.0])
    result2 = a + b

# 创建会话
with tf.Session(graph=g1) as sess:
    out = sess.run(result1)
    print(out)

with tf.Session(graph=g2) as sess:
    out = sess.run(result2)
    print(out)

with tf.Session(graph=tf.get_default_graph()) as sess:
    out = sess.run(result)
    print(out)

返回：
[2.0, 2.0]
[4.0, 4.0]
[3.0, 5.0]

3.3.2.2 tf.placeholder()

$\quad$ placeholder 是 Tensorflow 中的占位符，暂时储存变量.
Tensorflow 如果想要从外部传入data, 那就需要用到 tf.placeholder(), 然后以这种形式传输数据 sess.run(***, feed_dict={input: **}).

3.3.2.3 tf.Variable()

$\quad$ 在 Tensorflow 中，定义了某字符串是变量，它才是变量，这一点是与 Python 所不同的。定义语法： state = tf.Variable()。.如果你在 Tensorflow 中设定了变量，那么初始化变量是最重要的！！所以定义了变量以后, 一定要定义 init = tf.global_variables_initializer().到这里变量还是没有被激活，需要再在 sess 里, sess.run(init) , 激活 init 这一步.

3.3.2.4 tf.nn.embedding_lookup()

$\quad$ tf.nn.embedding_lookup函数的用法主要是选取一个张量里面索引对应的元素。tf.nn.embedding_lookup（params, ids）:params可以是张量也可以是数组等，id就是对应的索引，其他的参数不介绍。
$\quad$ 第一个例子的输入

p=tf.Variable(tf.random_normal([10,1]))#生成10*1的张量
b = tf.nn.embedding_lookup(p, [1, 3])#查找张量中的序号为1和3的
 
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(b))
    print(sess.run(p))

$\quad$ 第一个例子的输出

[[0.15791859]
 [0.6468804 ]]
[[-0.2737084 ]
 [ 0.15791859]
 [-0.01315552]
 [ 0.6468804 ]
 [-1.4090979 ]
 [ 2.1583703 ]
 [ 1.4137447 ]
 [ 0.20688428]
 [-0.32815856]
 [-1.0601649 ]]

$\quad$ 第二个例子的输入

a = [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3], [2.1, 2.2, 2.3], [3.1, 3.2, 3.3], [4.1, 4.2, 4.3]]
a = np.asarray(a)
idx1 = tf.Variable([0, 2, 3, 1], tf.int32)
idx2 = tf.Variable([[0, 2, 3, 1], [4, 0, 2, 2]], tf.int32)
out1 = tf.nn.embedding_lookup(a, idx1)
out2 = tf.nn.embedding_lookup(a, idx2)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    print sess.run(out1)
    print '=================='
    print sess.run(out2)

$\quad$ 第二个例子的输出

[[ 0.1  0.2  0.3]
 [ 2.1  2.2  2.3]
 [ 3.1  3.2  3.3]
 [ 1.1  1.2  1.3]]
==================
[[[ 0.1  0.2  0.3]
  [ 2.1  2.2  2.3]
  [ 3.1  3.2  3.3]
  [ 1.1  1.2  1.3]]

 [[ 4.1  4.2  4.3]
  [ 0.1  0.2  0.3]
  [ 2.1  2.2  2.3]
  [ 2.1  2.2  2.3]]]

3.3.2.5 一个简单的tensorflow的例子

做一个简单的线性回归任务

import tensorflow as tf
import numpy as np

# x_data是输入，y_data是真实输出
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 2 + 0.1

# 定义权重和偏置
Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
biases = tf.Variable(tf.zeros([1]))

# 构建损失函数
y = Weights * x_data + biases
loss = tf.reduce_mean(tf.square(y - y_data))

# 使用梯度下降，定义学习率
optimizer = tf.train.GradientDescentOptimizer(0.5)

# 目标函数是最小化
train = optimizer.minimize(loss)

# 初始化变量
init = tf.global_variables_initializer()

# 设置迭代次数为100次，并每10次迭代后输出两个参数
epochs = 100
with tf.Session() as sess:
    sess.run(init)
    for e in range(1, epochs + 1):
        sess.run(train)
        if e % 10 == 0:
            print("step: {}, weight: {}, bias: {}".format(e, sess.run(Weights), sess.run(biases)))

3.3.3 skip-gram代码

$\quad$ skip-gram模型采用子采样的方法来处理文本中的停用词，通过在词频上设置阈值，可以消除所有那些词频较高且中心词周围没有任何重要上下文的单词。

import numpy as np
import random
"""Creating the threshold and performing the subsampling"""
thresh = 0.00005
word_counts = collections.Counter(words_cnt)
total_count = len(word_counts)
freqs = {word: count / total_count for word, count in word_counts.items()}
p_drop = {word: 1 - np.sqrt(thresh / freqs[word]) for word in word_counts}
train_words = [word for word in words_cnt if p_drop[word] < random.random()]

$\quad$ 当skip-gram模型接受中心词并预测其周围的单词时，skipG_target_set_generation()函数以所需格式创建skip-gram模型的输入

def skipG_target_set_generation(batch_, batch_index, word_window):
    """The function combines the words of given word_window size next to the index, for the SkipGram model"""
    random_num = np.random.randint(1, word_window + 1)
    words_start = batch_index - random_num if (batch_index - random_num) > 0 else 0
    words_stop = batch_index + random_num
    window_target = set(batch_[words_start: batch_index] + batch_[batch_index + 1: words_stop + 1])
    return list(window_target)

$\quad$ skipG_batch_creation()函数调用skipG_target_set_generation()函数，并创建中心词及其周围单词的组合格式，将其作为目标文本并返回批输出。

def skipG_batch_creation(short_words, batch_length, word_window):
    """The function internally makes use of the skipG_target_set_generation()
    function and combines each of the label words in the shortlisted_words with
    the words of word_window size around"""
    batch_cnt = len(short_words) // batch_length
    short_words = short_words[: batch_cnt * batch_length]
    for word_index in range(0, len(short_words), batch_length):
        input_words, label_words = [], []
        word_batch = short_words[word_index: word_index + batch_length]
        for index_ in range(len(word_batch)):
            batch_input = word_batch[index_]
            batch_label = skipG_target_set_generation(word_batch, index_, word_window)
            # Appending the label and inputs to the initial list.Replicating input to the size of labels in the window
            label_words.extend(batch_label)
            input_words.extend([batch_input] * len(batch_label))
            yield input_words, label_words

$\quad$ 注册一个用于skip-gram实现的TensorFlow图，并声明变量的输入和标签占位符，它们将用于按照中心词和周围单词的组合为输入单词和大小不同的批量分配单热编码向量。

import tensorflow as tf
tf_graph = tf.Graph()
with tf_graph.as_default():
	# tf.placeholder(dtype, shape=None, name=None)
	# 一维，行列不确定
    input_ = tf.placeholder(tf.int32, [None], name="input_")
    # 二维，行列不确定
    label_ = tf.placeholder(tf.int32, [None, None], name="label_")

$\quad$ 声明嵌入矩阵的变量，该矩阵的维度等于词汇表的大小和词嵌入向量的维度。

with tf_graph.as_default():
	# random_uniform(shape: Any, minval: int = 0, maxval: Any = None, dtype: DType = dtypes.float32, seed: Any = None, name: Any = None) 
    # len(rev_dictionary_) * 300，初始值为-1，在后面会重新对值进行初始化
    word_embed = tf.Variable(tf.random_uniform((len(rev_dictionary_), 300), -1, -1))
    embedding = tf.nn.embedding_lookup(word_embed, input_)

$\quad$ tf.train.AdamOptimizer使用Adam算法来控制学习率。

"""This code includes the following:
# Initializing weights and bias to be used in the softmax layer
# Loss function calculation using the Negative Sampling
# Usage of the Adam Optimizer
# Negative sampling on 100 words, to be included in the loss function
# 300 is the word embedding vector size
"""
vocabulary_size = len(rev_dictionary_)

with tf_graph.as_default():
    sf_weights = tf.Variable(tf.truncated_normal((vocabulary_size, 300), stddev=0.1))
    sf_bias = tf.Variable(tf.zeros(vocabulary_size))
    loss_fn = tf.nn.sampled_softmax_loss(weights=sf_weights, biases=sf_bias, 
                                         labels=label_, inputs=embedding,
                                         num_sampled=100, num_classes=vocabulary_size)
    cost_fn = tf.reduce_mean(loss_fn)
    optim =tf.train.AdamOptimizer().minimize(cost_fn)

$\quad$ 为了确保单词的向量表示保持了单词之间的语义相似性，我们在下面的部分生成一个验证集。它将在语料库中选择常见和不常见词的组合，并基于词向量之间的余弦相似性返回最接近它们的单词。

"""The below code performs the following operations:
# Performing validation here by making use of a random
  selection of 16 words from the dictionary of desired size
# Selecting 8 words randomly from range of 1000
# Using the cosine distance to calculate the similarity between the words
"""
with tf_graph.as_default():
    validaton_cnt = 16
    validation_dict = 100
    validation_words = np.array(random.sample(range(validation_dict), validaton_cnt // 2))
    validation_words = np.append(validation_words, random.sample(range(1000, 1000 + validation_dict), validaton_cnt // 2))
    validation_data = tf.constant(validation_words, dtype=tf.int32)
    normalization_embed = word_embed / (tf.sqrt(tf.reduce_sum(tf.square(word_embed), 1, keepdims=True)))
    validation_embed = tf.nn.embedding_lookup(normalization_embed, validation_data)
    word_similarity = tf.matmul(validation_embed, tf.transpose(normalization_embed))

$\quad$ 在当前工作目录中创建文件夹model_checkpoint已存储模型检查点。

"""Creating the model checkpoint directory"""
# Increase it as per computation resources. It has been kept low here for users to replicate the process,
# increase to 100 or more
import time
epochs = 2
batch_length = 1000
word_window = 10

with tf_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=tf_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer())

    for e in range(1, epochs + 1):
        batches = skipG_batch_creation(train_words, batch_length, word_window)
        start = time.time()
        for x, y in batches:
            train_loss, _ = sess.run([cost_fn, optim],
                                     feed_dict={input_: x, label_: np.array(y)[:, None]})
            loss += train_loss
            if iteration % 100 == 0:
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      ", Iteration: {}".format(iteration),
                      ", Avg.Training loss: {:.4f}".format(loss / 100),
                      ", Processing: {:.4f} sec/batch".format((end - start) / 100))
                loss = 0
                start = time.time()
            if iteration % 2000 == 0:
                similarity_ = word_similarity.eval()
                for i in range(validaton_cnt):
                    validation_words = rev_dictionary_[validation_words[i]]
                    # number of nearest neighbors
                    top_k = 8
                    nearest = (-similarity_[i, :]).argsort()[1: top_k + 1]
                    log = "Nearest to %s:" % validation_words
                    for k in range(top_k):
                        close_word = rev_dictionary_[nearest[k]]
                        log = "%s, %s" % (log, close_word)
                    print(log)
            iteration += 1
    save_path = saver.save(sess, "D:/py3/CSDN/word2vec/model_checkpoint/skipGram_text8.ckpt")
    embed_mat = sess.run(normalization_embed)

$\quad$ 以上代码运行的部分结果

Epoch 1/2 , Iteration: 100 , Avg.Training loss: 1.6499 , Processing: 0.1321 sec/batch
Epoch 1/2 , Iteration: 200 , Avg.Training loss: 0.9029 , Processing: 0.1345 sec/batch
Epoch 1/2 , Iteration: 300 , Avg.Training loss: 1.6471 , Processing: 0.1383 sec/batch
Epoch 1/2 , Iteration: 400 , Avg.Training loss: 2.4165 , Processing: 0.1426 sec/batch
Epoch 1/2 , Iteration: 500 , Avg.Training loss: 3.6667 , Processing: 0.1461 sec/batch
Epoch 1/2 , Iteration: 600 , Avg.Training loss: 4.3952 , Processing: 0.1498 sec/batch
Epoch 1/2 , Iteration: 700 , Avg.Training loss: 6.6640 , Processing: 0.1529 sec/batch
Epoch 1/2 , Iteration: 800 , Avg.Training loss: 7.7097 , Processing: 0.1562 sec/batch
Epoch 1/2 , Iteration: 900 , Avg.Training loss: 9.0149 , Processing: 0.1601 sec/batch
Epoch 1/2 , Iteration: 1000 , Avg.Training loss: 9.4385 , Processing: 0.1637 sec/batch
...
Nearest to nine:, kittens, cooperates, imitates, axelrod, comparison, tat, step, entrepreneur
Nearest to th:, determines, louis, states, cyrillic, central, id, chickasaw, mobile
Nearest to be:, defend, plekhanov, characterization, persons, characterised, criticizes, utopian, bayonets
Nearest to first:, benefiting, should, altruists, appealing, convictions, mariano, pints, giulia
Nearest to were:, treaty, sheik, establishing, without, presume, sincerely, anonymously, simple
...

$\quad$ 使用t-SNE进行可视化.

from sklearn.manifold import TSNE
from matplotlib import pylab
num_points = 250
tsne = TSNE(perplexity=30, n_components=2, init="pca", n_iter=5000)
embeddings_2d = tsne.fit_transform(embed_mat[1: num_points+1, :])
def cbow_plot(embeddings, labels):
    pylab.figure(figsize=(12, 12))
    for i, label in enumerate(labels):
        x, y = embeddings[i, :]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2),
                       textcoords="offset points", ha="right", va="bottom")
    pylab.show()
words = [rev_dictionary_[i] for i in range(1, num_points + 1)]
cbow_plot(embeddings_2d, words)

在这里插入图片描述

3.3.4 CBOW代码

$\quad$ CBOW模型通过周围的单词来预测中心词。因此可以使用cbow_batch_creation()函数实现批和标签生成，而在将所需的word_window大小传递给该函数时，该函数会在label_变量中指定目标单词，并在batch变量中指定上下文中的周围单词。

import numpy as np
import collections

def cbow_batch_creation(batch_length, word_window):
    """The function creates a batch with the list of the label words and
    list of their corresponding words in the context of the label word
    Pulling out the centered label word, and its next word_window count of
    surrounding words
    word_window: window of words on either side of the center word
    relevent_words: length of the total words to be picked in a single batch,
                    including the center word and the word_window words on both sides
    Format: [word_window ... target ... word_window]"""
    data_index = 0
    relevant_words = 2 * word_window + 1
    batch = np.ndarray(shape=(batch_length, relevant_words-1), dtype=np.int32)
    label_ = np.ndarray(shape=(batch_length, 1), dtype=np.int32)
    buffer = collections.deque(maxlen=relevant_words)
    # Queue to add/pop
    # Selecting the words of length "relevant_words" from the starting index
    for _ in range(relevant_words):
        buffer.append(words_cnt[data_index])
        data_index = (data_index + 1) % len(words_cnt)
    for i in range(batch_length):
        # Center word as the label
        target = word_window
        # Excluding the label, and selecting only the surrounding words
        target_to_avoid = [word_window]
        # Add selected target to avoid_list for next time
        col_idx = 0
        for j in range(relevant_words):
            if j == relevant_words // 2:
                continue
            # Iterating till the middle element for window_size length
            batch[i, col_idx] = buffer[j]
            col_idx += 1
        label_[i, 0] = buffer[target]

        buffer.append(words_cnt[data_index])
        data_index = (data_index + 1) % len(words_cnt)
    return batch, label_

$\quad$ 在确保cbow_batch_creation()函数按照CBOW模型的输入工作的情况下，取出第一批标签的测试样本和围绕它的窗口长度为1和2的单词并打印结果。

for num_skips, word_window in [(2, 1), (4, 2)]:
    batch, label_ = cbow_batch_creation(batch_length=8, word_window=word_window)
    print("\nwith num_skips = %d and word_window = %d:" % (num_skips, word_window))
    print("batch:", [[rev_dictionary_[bii] for bii in bi] for bi in batch])
    print("label:", [rev_dictionary_[li] for li in label_.reshape(8)])

输出结果如下

with num_skips = 2 and word_window = 1:
batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of'], ['term', 'abuse'], ['of', 'first'], ['abuse', 'used'], ['first', 'against']]
label: ['originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used']
with num_skips = 4 and word_window = 2:
batch: [['anarchism', 'originated', 'a', 'term'], ['originated', 'as', 'term', 'of'], ['as', 'a', 'of', 'abuse'], ['a', 'term', 'abuse', 'first'], ['term', 'of', 'first', 'used'], ['of', 'abuse', 'used', 'against'], ['abuse', 'first', 'against', 'early'], ['first', 'used', 'early', 'working']]
label: ['as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']

$\quad$ 以下代码声明CBOW模型配置中使用的变量，词嵌入向量的大小被设为128，并且目标单词之前和之后的一个单词将被用于预测。

num_steps = 100001
"""Initializing:
   # 128 is the length of the batch considered for CBOW
   # 128 is the word embedding vector size
   # Considering 1 word on both sides of the center label words
   # Consider the center label word 2 times to create the batches 
"""
batch_length = 128
embedding_size = 128
skip_window = 1
num_skips = 2

$\quad$ 以下代码将注册用于CBOW实现的Tensorflow图，并计算向量之间的余弦相似度

import tensorflow as tf
import random
"""The below code performs the following operations:
   # Performing validation here by making use of a random selection of 16 words
     from the dictionary of desired size
   # Selecting 8 words randomly from range od 1000
   # Using the cosine distance to calculate the similarity between the words
"""
tf_cbow_graph = tf.Graph()
with tf_cbow_graph.as_default():
    validation_cnt = 16
    validation_dict = 100
    validation_words = np.array(random.sample(range(validation_dict), validation_cnt//2))
    validation_words = np.append(validation_words, random.sample(range(1000, 1000+validation_dict), validation_cnt//2))
    train_dataset = tf.placeholder(tf.int32, shape=[batch_length, 2 * skip_window])
    train_labels = tf.placeholder(tf.int32, shape=[batch_length, 1])
    validation_data = tf.constant(validation_words, dtype=tf.int32)
"""Embeddings for all the words present in the vocabulary"""
with tf_cbow_graph.as_default():
    vocabulary_size = len(rev_dictionary_)
    word_embed = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, -1.0))
    # Averaging embeddings across the full context into a single embedding layer
    context_embeddings = []
    for i in range(2 * skip_window):
        context_embeddings.append(tf.nn.embedding_lookup(word_embed, train_dataset[:, i]))
    embedding = tf.reduce_mean(tf.stack(axis=0, values=context_embeddings), 0, keepdims=False)

$\quad$ 以下代码使用64个单词的负采样来计算softmax损失，并进一步优化在模型训练中产生的权重、偏差和词嵌入。

import math
"""The code includes the following:
   # Initializing weights and bias to be used in the softmax layer
   # Loss function calculation using the Negative Sampling
   # Usage of AdaGrad Optimizer
   # Negative sampling on 64 words, to be included in the loss function
"""
with tf_cbow_graph.as_default():
    sf_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0/math.sqrt(embedding_size)))
    sf_bias = tf.Variable(tf.zeros([vocabulary_size]))
    loss_fn = tf.nn.sampled_softmax_loss(weights=sf_weights, biases=sf_bias, inputs=embedding,
                                         labels=train_labels, num_sampled=64, num_classes=vocabulary_size)
    cost_fn = tf.reduce_mean(loss_fn)
    """Using AdaGrad as optimizer"""
    optim = tf.train.AdamOptimizer(1.0).minimize(cost_fn)

$\quad$ 通过计算余弦相似度来进一步确保语义相似的单词的接近程度。

"""
Using the cosine distance to calculate the similarity between
the batches and embeddings of other words 
"""
with tf_cbow_graph.as_default():
    normalization_embed = word_embed / tf.sqrt(tf.reduce_sum(tf.square(word_embed), 1, keepdims=True))
    validation_embed = tf.nn.embedding_lookup(normalization_embed, validation_data)
    word_similarity = tf.matmul(validation_embed, tf.transpose(normalization_embed))

with tf.Session(graph=tf_cbow_graph) as sess:
    sess.run(tf.global_variables_initializer())
    avg_loss = 0
    for step in range(num_skips):
        batch_words, batch_label_ = cbow_batch_creation(batch_length, skip_window)
        _, l = sess.run([optim, loss_fn], feed_dict={train_dataset: batch_words, 
                                                     train_labels: batch_label_})
        avg_loss += l
        if step % 2000 == 0:
            if step > 0:
                avg_loss = avg_loss / 2000
            print("Average loss at step %d: %f" % (step, np.mean(avg_loss)))
            avg_loss = 0
        if step % 10000 == 0:
            sim = word_similarity.eval()
            for i in range(validation_cnt):
                valid_word = rev_dictionary_[validation_words[i]]
                # number of nearest neighbors
                top_k = 8
                nearest = (-sim[i, :]).argsort()[1: top_k + 1]
                log = "Nearest to %s:" % valid_word
                for k in range(top_k):
                    close_word = rev_dictionary_[nearest[k]]
                    log = "%s, %s" % (log, close_word)
                print(log)
    final_embeddings = sess.run(normalization_embed)

运行的部分结果如下

Average loss at step 0: 8.627106
Nearest to in:, political, as, first, pejorative, violent, positive, giantess, hamm
Nearest to a:, originated, up, anarchy, anarchism, most, radicals, way, is
Nearest to all:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to these:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to time:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to while:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to at:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to by:, defined, tauroctony, soured, hibbert, dimitri, celia, headstock, hamm
Nearest to quite:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to taking:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to mainly:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to cost:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to stage:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to san:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to writers:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines
Nearest to freedom:, marys, soured, hibbert, dimitri, celia, headstock, hamm, basslines

$\quad$ 使用t-SNE进行可视化，在二维空间显示250个随机单词的高维128维向量表示

from sklearn.manifold import TSNE
from matplotlib import pylab
num_points = 250
tsne = TSNE(perplexity=30, n_components=2, init="pca", n_iter=5000)
embeddings_2d = tsne.fit_transform(final_embeddings[1: num_points+1, :])

def cbow_plot(embeddings, labels):
    pylab.figure(figsize=(12, 12))
    for i, label in enumerate(labels):
        x, y = embeddings[i, :]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2),
                       textcoords="offset points", ha="right", va="bottom")
    pylab.show()
words = [rev_dictionary_[i] for i in range(1, num_points + 1)]
cbow_plot(embeddings_2d, words)