本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。
对了宝贝儿们,卑微小李的公众号【野指针小李】已开通,期待与你一起探讨学术哟~摸摸大!
0 前言
本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。在作者代码的基础上,我添加了部分自己的注释(作者的注释是英文,我的注释是用的中文)。代码已上传至github,这里是链接。
如果有任何错误或者没有讲解清楚的部分,请评论在下方,看到后我会更改。
关于Word2Vec的原理以及两个优化方法——Hierachical softmax与negative sampling,如果有疑问的同学,可以参考我之前的两篇文章:Word2Vec原理与公式详细推导,Word2Vec之Hierarchical Softmax与Negative Sampling。
TensorFlow版本是1.8.0。
1 数据集准备
数据集准备这一节没有什么可说的,就是下载数据。
url = 'http://www.evanjones.ca/software/'
def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
if not os.path.exists(filename):
print('Downloading file...')
filename, _ = urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified %s' % filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
return filename
filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)
不过我也不知道为什么,这个网站我打开是Not found,但是可以下载数据。
2 读取数据但不做预处理
def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words"""
with bz2.BZ2File(filename) as f:
data = []
file_string = f.read().decode('utf-8')
file_string = nltk.word_tokenize(file_string)
data.extend(file_string)
return data
words = read_data(filename)
print('Data size %d' % len(words))
print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])
这一步就是从下载的文件中读取数据并做分词操作,由于没有做预处理,又有一千多万个数据,所以这一行代码运行速度很慢。而且后面使用的也不是这个没有做预处理的数据。
输出如下:
Data size 11634727
Example words (start): ['Propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end): ['useless', 'for', 'cultivation', '.', 'and', 'people', 'have', 'sex', 'there', '.']
3 读取数据并做预处理
def read_data(filename):
"""
Extract the first file enclosed in a zip file as a list of words
and pre-processes it using the nltk python library
"""
with bz2.BZ2File(filename) as f:
data = []
file_size = os.stat(filename).st_size
chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large
print('Reading data...')
for i in range(ceil(file_size//chunk_size)+1):
bytes_to_read = min(chunk_size,file_size-(i*chunk_size))
file_string = f.read(bytes_to_read).decode('utf-8')
file_string = file_string.lower()
# tokenizes a string to words residing in a list
file_string = nltk.word_tokenize(file_string)
data.extend(file_string)
return data
words = read_data(filename)
print('Data size %d' % len(words))
print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])
这一步预处理主要就在两部分,第一部分是一次只读取1M数据,第二部分就是将所有的单词变为小写。
输出如下:
Reading data...
Data size 3361192
Example words (start): ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end): ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']
4 创建词典
这一步主要是做词语与ID的映射,以"I like to go to school"为例,映射的规则如下:
dictionary
: 词语与ID的映射关系 (e.g. {‘I’:0, ‘like’: 1, ‘to’: 2, ‘go’: 3, ‘school’: 4})。reverse_dictionary
: ID与词语的映射关系(就是将dictionary
的键值反一转)(e.g. {0: ‘I’, 1: ‘like’, 2: ‘to’, 3: ‘go’, 4: ‘school’})。count
: 一个列表,列表中的每个元素是一个元组,每个元组中的元素为单词以及频率(e.g. [(‘I’, 1), (‘like’, 1), (‘to’, 2), (‘go’, 1), (‘school’, 1)])。data
: 文本中的词语,这些词语以ID来代替 (e.g. [0, 1, 2, 3, 2, 4])。UNK
: 稀有词语,即去掉出现频率最高的50000个词语后的所有词语。
# we restrict our vocabulary size to 50000
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]] # 因为后面还需要更改这个-1的值,所以这里是列表
# Gets only the vocabulary_size most common words as the vocabulary
# All the other words will be replaced with UNK token
# 就是说提取50000个最常见的单词,其余全部归类为'UNK'
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
# Create an ID for each word by giving the current length of the dictionary
# And adding that item to the dictionary
# word: 单词名, _: 出现的次数(频率)
# 这一步是做词语与id的映射
# dictionary: {'word1': 0, 'word2': 1, 'word3': 2, ...}
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0 # 记录有多少个unk
# Traverse through all the text we have and produce a list
# where each element corresponds to the ID of the word found at that index
# 如果词语在dictionary中,则用该词语的id
# 否则则是UNK,id为0(因为dictionary中第1个单词也是UNK)
for word in words:
# If word is in the dictionary use the word ID,
# else use the ID of the special token "UNK"
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
# update the count variable with the number of UNK occurences
# 更新count,其实就是更新UNK的数量
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
# Make sure the dictionary is of size of the vocabulary
assert len(dictionary) == vocabulary_size
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words # Hint to reduce memory.
首先先说这个count
,在上面的解释中写到count
中的元素是元组,但是build_dataset
中第一行就定义了count = [['UNK', -1]]
,这是因为在后面还需要更改这个-1
的值,所以这里定义的count[0]
是个列表。也就是说count
是如下的:
count = [
['UNK', 68751],
('the', 226893),
...
('suggested', 336)
]
接着,dictionary
的创建也很有灵性,先是提取出count
中的词语(count
中的词语是按照从大到小的顺序排列好了的),放入dictionary
中。由于count
是通过collections
生成的,所以没有重复的值,放入字典后,ID就是该字典的长度,由于每一轮会放一个键值对进去,所以每一轮字典长度会增加1,这样就实现了ID一次增加1。
再之后就是将dictionary
中的值提取出来放入data
中,并根据其键值构建reverse_dictionary
。
输出如下:
Most common words (+UNK) [['UNK', 68751], ('the', 226893), (',', 184013), ('.', 120919), ('of', 116323)]
Sample data [1721, 9, 8, 16479, 223, 4, 5168, 4459, 26, 11597]
5 定义Skip-gram的batch
skip-gram是根据一个单词预测其上下文,其网络结构如下所示:
这里需要注意的是,Word2Vec的隐藏层是线性的,所以在隐藏层没有激活函数。
由于这一步是定义输入与输出的标签。设batch
为输入词;labels
是该输入词的上下文;设span
是窗口大小与目标词语,大小为
2
∗
w
i
n
d
o
w
_
s
i
z
e
+
1
2 * {\rm window\_size} + 1
2∗window_size+1,由于window_size是单侧的窗口大小,所以这里要乘2,也就是说上下文大小为
2
∗
w
i
n
d
o
w
_
s
i
z
e
2 * {\rm window\_size}
2∗window_size。
data_index = 0
def generate_batch_skip_gram(batch_size, window_size):
# data_index is updated by 1 everytime we read a data point
# 调用外部的变量
global data_index
# print('global data_index:', data_index)
# print('batch_size:', batch_size)
# print('window_size:', window_size)
# two numpy arras to hold target words (batch)
# and context words (labels)
# batch: 随机初始大小为 batch_size × 1 的向量
# labels: 随机初始大小为 1 × batch_size 的向量
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
# span defines the total window size, where
# data we consider at an instance looks as follows.
# [ skip_window target skip_window ]
# span: 2 * 窗口大小 + 1, i.e. 左右上下文 + 目标词
span = 2 * window_size + 1
# The buffer holds the data contained within the span
# deque创建的队列可以从左从右加入数据
buffer = collections.deque(maxlen=span)
# Fill the buffer and update the data_index
# 将ID添加到buffer中
# 该循环运行完后,data_index会移动到窗口后一位
# e.g. window_size = 2, span = 5, 循环结束后 data_index = 5
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
# print('data_index: ', data_index)
# print('buffer before loop:', buffer)
# This is the number of context words we sample for a single target word
# 上下文的大小
num_samples = 2*window_size
# We break the batch reading into two for loops
# The inner for loop fills in the batch and labels with
# num_samples data points using data contained within the span
# The outper for loop repeat this for batch_size//num_samples times
# to produce a full batch
# 假设窗口大小为2
# range(batch_size // num_samples): 0, 1
for i in range(batch_size // num_samples):
k=0
# avoid the target word itself as a prediction
# fill in batch and label numpy arrays
# 假设窗口大小为2
# list(range(window_size)): [0, 1]
# list(range(window_size + 1, 2 * window_size + 1)): [3, 4]
# j 的循环范围: [0, 1, 3, 4], 避开了2,, 也就是目标词汇
for j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):
batch[i * num_samples + k] = buffer[window_size]
labels[i * num_samples + k, 0] = buffer[j]
k += 1
# Everytime we read num_samples data points,
# we have created the maximum number of datapoints possible
# withing a single span, so we need to move the span by 1
# to create a fresh new span
# 由于buffer的maxlen只有窗口 + 1的大小
# 所以这里append一个新的元素会使得第一个元素出队
# 就可以使得窗口向右滑动一格
buffer.append(data[data_index])
# print('buffer after change:', buffer)
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:8]])
for window_size in [1, 2]:
data_index = 0
# batch: 目标词汇
# labels: 上下文单词
# batch中目标词汇对应的索引在labels中就是该词汇的上下文
batch, labels = generate_batch_skip_gram(batch_size=8, window_size=window_size)
print('\nwith window_size = %d:' %window_size)
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
这里buffer
是用collections.deque()
创建的队列,该队列有个特点,可以从左从右入队出队,而且可以设置固定长度,当达到最大长度后,新入队的元素会把最远(从右入队最左边的元素出队)的那个元素挤出去。
换句话说,buffer
里存储的就是这一轮循环中目标词汇及其上下文的单词。就比如刚刚的例子:I like to go to school,假设window_size=1
,则span=3
,第一轮中buffer
存储的就是[‘I’, ‘like’, ‘to’],这里batch
为’like’,labels
为[‘I’, ‘to’]。
这里有个细节,就是这串代码:
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
核心在data_index = (data_index + 1) % len(data)
,细节就在,运行到最后一轮的时候,这个data_index
还会+1。就比如window_size=1
(span=3
),运行完后data_index=3
,这里再结合上后面循环中的代码:
buffer.append(data[data_index])
# print('buffer after change:', buffer)
data_index = (data_index + 1) % len(data)
就实现了滑动窗口。在循环结尾处,buffer
添加了data_index=3
的数据,挤出了data_index=0
的数据。
输出如下:
data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']
with window_size = 1:
batch: ['is', 'is', 'a', 'a', 'concerted', 'concerted', 'set', 'set']
labels: ['propaganda', 'a', 'is', 'concerted', 'a', 'set', 'concerted', 'of']
with window_size = 2:
batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']
labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']
这里输出内容中,batch
与labels
的索引是一一对应的,即labels[i]
是目标词batch[i]
的上下文。
6 Skip-gram
6.1 定义超参数
这里定义的超参数为:
batch_size
: 一个batch中的样本数embedding_size
: 嵌入向量的大小(隐藏层的神经元个数)window_size
: 上下文大小valid_size
: 选择的验证集数据个数valid_window
: 验证集窗口大小(从该窗口中随机选择验证集的索引)num_sampled
: 负采样的个数
batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.
# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50
# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)
num_sampled = 32 # Number of negative examples to sample.
这里random.sample()
是随机从列表中取元素,但是不会影响到列表本身的排序。
6.2 定义输入与输出的占位符
tf.reset_default_graph()
# Training input data (target word IDs).
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
train_dataset
: 大小为
128
×
1
128\times 1
128×1,为每一个batch输入的数据。
train_labels
: 大小为
128
×
1
128 \times 1
128×1,为每一个batch输入数据的标签。
valid_dataset
: 大小为
32
×
1
32 \times 1
32×1,为验证集中的数据。
6.3 定义模型参数与其他变量
# Variables
# Embedding layer, contains the word embeddings
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
# Softmax Weights and Biases
softmax_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=0.5 / math.sqrt(embedding_size))
)
softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))
truncated_normal()
: 截断的产生正态分布随机数,随机数与均值的差值若大于两倍的标准差,则重新生成。
embeddings
:
W
W
W,输入层到隐藏层的权重矩阵,
50000
×
128
50000 \times 128
50000×128的张量,均匀分布在
[
−
1
,
1
]
[-1,1]
[−1,1]
softmax_weights
:
W
′
W'
W′,隐藏层到输出层的权重矩阵,
50000
×
128
50000\times 128
50000×128的张量,截断正态分布,均值为0,标准差为
0.5
128
\frac{0.5}{\sqrt{128}}
1280.5
softmax_biases
:
b
b
b,输出层的偏置向量,
50000
×
1
50000\times 1
50000×1,均匀分布在
[
0
,
0.01
]
[0, 0.01]
[0,0.01]
这里按理说softmax_weights
的大小应该是
128
×
50000
128 \times 50000
128×50000才对,而大小为
50000
×
128
50000 \times 128
50000×128的原因在于truncated_normal()
这个函数,源码中关于weights
这个变量的定义为:
weights: A
Tensor
of shape[num_classes, dim]
, or a list ofTensor
objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-sharded) class embeddings.
如果不清楚这个num_classes
和dim
哪个是哪个,那么就再查一下biases
的定义:
biases: A
Tensor
of shape[num_classes]
. The class biases.
这里就很明确了,num_classes
是输出层的神经元个数,所以这里softmax_weights
的大小为
50000
×
128
50000 \times 128
50000×128。
6.4 定义模型计算
模型计算这里,首先通过查询方法embedding_lookup()
来获得给定输入与隐藏层向量的关系,同时也定义了负采样的损失函数tf.nn.sampled_softmax_loss
。
# Model.
# Look up embeddings for a batch of inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
# Compute the softmax loss, using a sample of the negative labels each time.
# 计算的是平均损失
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(
weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)
)
这里为何直接用输入来查找embedding
的向量,先卖个关子,后面讲解模型运行的时候一并讲解。
6.5 计算词语相似度
这里是采用的余弦相似度计算两个词语的相似程度。
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
余弦相似度的公式如下:
c o s θ = A ⃗ ⋅ B ⃗ ∣ A ⃗ ∣ × ∣ B ⃗ ∣ {\rm cos}\theta = \frac{\vec{A}·\vec{B}}{|\vec{A}| \times |\vec{B}|} cosθ=∣A∣×∣B∣A⋅B
norm
: 矩阵中每行元素的模(
N
×
1
N\times 1
N×1的向量)。
normalized_embeddings
: 这里是TensorFlow中的矩阵除法,如果矩阵
A
\bold{A}
A的大小是
N
×
M
N\times M
N×M,则被除的向量
v
⃗
\vec{v}
v的大小必须是
N
×
1
N \times 1
N×1,其计算过程就是矩阵
A
\bold{A}
A中第
i
i
i行的所有元素除以向量
v
⃗
\vec{v}
v中第
i
i
i行的元素。这里除出来,那么就是embeddings
这个矩阵中每个元素除以了它所在行的模,也就是个L2正则化过程。
valid_embeddings
: 这里主要是从normalized_embeddings
中提取出验证集的数据。
similarity
: 由于经过了L2正则化,所以矩阵中每一行的模都为1,即
∣
A
⃗
∣
×
∣
B
⃗
∣
=
1
|\vec{A}| \times |\vec{B}|=1
∣A∣×∣B∣=1,于是余弦相似度就变为了
c
o
s
θ
=
A
⃗
⋅
B
⃗
{\rm cos}\theta = \vec{A}·\vec{B}
cosθ=A⋅B,我们唯一要做的就是点积两个向量。
6.6 优化器
优化器采用的是Adagrad
优化器,学习率设置的1.0。Adagrad
优化器是解决不同参数应该使用不同的更新速率的问题。Adagrad自适应地为各个参数分配不同学习率的算法。
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
6.7 运行Skip-gram
num_steps = 100001
skip_losses = []
# ConfigProto is a way of providing various configuration settings
# required to execute the graph
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
# Initialize the variables in the graph
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
# Train the Word2vec model for num_step iterations
for step in range(num_steps):
# Generate a single batch of data
batch_data, batch_labels = generate_batch_skip_gram(
batch_size, window_size)
# Populate the feed_dict and run the optimizer (minimize loss)
# and compute the loss
# _: 占位符,调用优化函数
# l: 损失
feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
# Update the average loss variable
average_loss += l
# 每2000步计算一次平均损失的平均
if (step + 1) % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
skip_losses.append(average_loss)
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step + 1, average_loss))
average_loss = 0
# Evaluating validation set word similarities
if (step + 1) % 10000 == 0:
sim = similarity.eval()
# Here we compute the top_k closest words for a given validation word
# in terms of the cosine distance
# We do this for all the words in the validation set
# Note: This is an expensive step
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]] # 提取出验证集中的词语
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1] # 相似度按照从大到小排序
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
skip_gram_final_embeddings = normalized_embeddings.eval()
# We will save the word vectors learned and the loss over time
# as this information is required later for comparisons
np.save('skip_embeddings',skip_gram_final_embeddings)
with open('skip_losses.csv', 'wt') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(skip_losses)
这里最主要的困难我觉得就是6.4中提到的为何直接用输入向量来查询输入层到隐藏层 W W W中的某一行向量。
这里一方面可以说是作者简化代码秀一手操作,或者说就是强行降低代码的易读性。首先我们看代码,train_dataset
接收的内容是batch_data
,batch_data
在5中提到了,是输入词语的ids。接着这个train_dataset
放入到6.4中的embed
中查询embeddings
的向量。最后放入到loss
中计算损失。由于Word2Vec的输入层输入的是每个词语的one-hot编码,就导致在隐藏层接收到的内容
h
=
W
T
x
=
W
k
,
⋅
T
\bold{h}=\bold{W}^{\rm T}\bold{x}=\bold{W}_{k,·}^{\rm T}
h=WTx=Wk,⋅T,即矩阵
W
\bold{W}
W的第
k
k
k行。所以这里直接查询输入词语对应的权重矩阵
W
\bold{W}
W中对应的向量,就等于查询到了
W
k
,
⋅
T
\bold{W}_{k,·}^{\rm T}
Wk,⋅T。
后面的部分就是求解词语的相似度,以及输出损失,相应的东西看看注释就能够看得懂了。
6.8 可视化Skip-gram学习过程
这里的可视化我就只是简单看了眼,因为核心还是在学习过程,所以这里就只是贴上代码以及我当时看代码写的注释。
6.8.1 仅查找聚类的单词而非稀疏分布的单词
def find_clustered_embeddings(embeddings,distance_threshold,sample_threshold):
'''
Find only the closely clustered embeddings.
This gets rid of more sparsly distributed word embeddings and make the visualization clearer
This is useful for t-SNE visualization
distance_threshold: maximum distance between two points to qualify as neighbors
sample_threshold: number of neighbors required to be considered a cluster
'''
# calculate cosine similarity
cosine_sim = np.dot(embeddings,np.transpose(embeddings))
norm = np.dot(np.sum(embeddings**2,axis=1).reshape(-1,1),np.sum(np.transpose(embeddings)**2,axis=0).reshape(1,-1))
assert cosine_sim.shape == norm.shape
cosine_sim /= norm # 求得余弦相似度
# make all the diagonal entries zero otherwise this will be picked as highest
# 在这里对角线元素就是词语 i 的向量与自己的点积
# 就是说不考虑自己与自己的相似度(因为自己与自己的相似度是最高的)
np.fill_diagonal(cosine_sim, -1.0) # 将cosine_sim的对角线元素变为-1
argmax_cos_sim = np.argmax(cosine_sim, axis=1)
mod_cos_sim = cosine_sim
# find the maximums in a loop to count if there are more than n items above threshold
# 将每一行的最大值设置为-1
for _ in range(sample_threshold-1):
argmax_cos_sim = np.argmax(cosine_sim, axis=1) # 每一轮迭代选一个最大值出来
mod_cos_sim[np.arange(mod_cos_sim.shape[0]),argmax_cos_sim] = -1 # 将这轮迭代选出的最大值设置为-1
max_cosine_sim = np.max(mod_cos_sim,axis=1) # 迭代完成后选择出当前矩阵中的最大元素
return np.where(max_cosine_sim>distance_threshold)[0]
6.8.2 采用sklearn计算词嵌入的t-SNE可视化
num_points = 1000 # we will use a large sample space to build the T-SNE manifold and then prune it using cosine similarity
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
print('Fitting embeddings to T-SNE. This can take some time ...')
# get the T-SNE manifold
selected_embeddings = skip_gram_final_embeddings[:num_points, :] # 提取前1000个嵌入向量
two_d_embeddings = tsne.fit_transform(selected_embeddings)
print('Pruning the T-SNE embeddings')
# prune the embeddings by getting ones only more than n-many sample above the similarity threshold
# this unclutters the visualization
selected_ids = find_clustered_embeddings(selected_embeddings,.25,10) # 获得聚类单词的id
two_d_embeddings = two_d_embeddings[selected_ids,:]
print('Out of ',num_points,' samples, ', selected_ids.shape[0],' samples were selected by pruning')
6.8.3 采用matplotlib绘制t-SNE的图
def plot(embeddings, labels):
n_clusters = 20 # number of clusters
# automatically build a discrete set of colors, each for cluster
label_colors = [pylab.cm.Spectral(float(i) /n_clusters) for i in range(n_clusters)]
assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
# Define K-Means
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(embeddings)
kmeans_labels = kmeans.labels_
pylab.figure(figsize=(15,15)) # in inches
# plot all the embeddings and their corresponding words
for i, (label,klabel) in enumerate(zip(labels,kmeans_labels)):
x, y = embeddings[i,:]
pylab.scatter(x, y, c=label_colors[klabel])
pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
ha='right', va='bottom',fontsize=10)
# use for saving the figure if needed
#pylab.savefig('word_embeddings.png')
pylab.show()
words = [reverse_dictionary[i] for i in selected_ids]
plot(two_d_embeddings, words)
生成的图如下:
7 CBOW
7.1 更改数据生成过程
由于CBOW与Skip-gram的网络结构不同,所以需要更改数据生成过程。CBOW的网络结构如下:
由于CBOW有多个输入,所以输入向量的大小从
b
a
t
c
h
_
s
i
z
e
×
1
{\rm batch\_size \times 1}
batch_size×1变为
b
a
t
c
h
_
s
i
z
e
×
(
c
o
n
t
e
x
t
_
w
i
n
d
o
w
∗
2
)
{\rm batch\_size} \times ({\rm context\_window} * 2)
batch_size×(context_window∗2)。
data_index = 0
def generate_batch_cbow(batch_size, window_size):
# window_size is the amount of words we're looking at from each side of a given word
# creates a single batch
# 输入是词语 i 的上下文,单侧上下文有 window_size 个单词
# data_index is updated by 1 everytime we read a set of data point
global data_index
# span defines the total window size, where
# data we consider at an instance looks as follows.
# [ skip_window target skip_window ]
# e.g if skip_window = 2 then span = 5
# 左右侧上下文单词 + 目标单词
span = 2 * window_size + 1 # [ skip_window target skip_window ]
# two numpy arras to hold target words (batch)
# and context words (labels)
# Note that batch has span-1=2*window_size columns
# batch: 目标单词的上下文单词
# labels: 目标单词
batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
# The buffer holds the data contained within the span
buffer = collections.deque(maxlen=span)
# Fill the buffer and update the data_index
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
# Here we do the batch reading
# We iterate through each batch index
# For each batch index, we iterate through span elements
# to fill in the columns of batch array
# i: 记录目标词语的索引
for i in range(batch_size):
target = window_size # target label at the center of the buffer
target_to_avoid = [ window_size ] # we only need to know the words around a given word, not the word itself
# add selected target to avoid_list for next time
col_idx = 0 # 记录上下文词语在batch中的索引
# j: 记录上下文词语在buffer中的索引
for j in range(span):
# ignore the target word when creating the batch
if j==span//2:
continue
batch[i,col_idx] = buffer[j]
col_idx += 1
labels[i, 0] = buffer[target]
# Everytime we read a data point,
# we need to move the span by 1
# to create a fresh new span
# 移动滑动窗口
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels
print('data:', [reverse_dictionary[di] for di in data[:8]])
for window_size in [1,2]:
data_index = 0
batch, labels = generate_batch_cbow(batch_size=8, window_size=window_size)
print('\nwith window_size = %d:' % (window_size))
print(' batch:', [[reverse_dictionary[bii] for bii in bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
这里和5中的generate_batch_skip_gram()
类似,只是区别在于batch
和labels
。
for i in range(batch_size):
target = window_size # target label at the center of the buffer
target_to_avoid = [ window_size ] # we only need to know the words around a given word, not the word itself
# add selected target to avoid_list for next time
col_idx = 0 # 记录上下文词语在batch中的索引
# j: 记录上下文词语在buffer中的索引
for j in range(span):
# ignore the target word when creating the batch
if j==span//2:
continue
batch[i,col_idx] = buffer[j]
col_idx += 1
labels[i, 0] = buffer[target]
# Everytime we read a data point,
# we need to move the span by 1
# to create a fresh new span
# 移动滑动窗口
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
这里采用的是span // 2
跳过目标单词。由于CBOW是根据上下文预测目标单词,所以目标单词成了labels
,输入单词变为了目标单词的上下文。因为buffer
中记录了目标单词以及其上下文,所以在循环时需要跳过目标单词的下标。在每一轮循环中labels
记录预测的单词,batch
记录上下文单词。
输出如下:
data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']
with window_size = 1:
batch: [['propaganda', 'a'], ['is', 'concerted'], ['a', 'set'], ['concerted', 'of'], ['set', 'messages'], ['of', 'aimed'], ['messages', 'at'], ['aimed', 'influencing']]
labels: ['is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']
with window_size = 2:
batch: [['propaganda', 'is', 'concerted', 'set'], ['is', 'a', 'set', 'of'], ['a', 'concerted', 'of', 'messages'], ['concerted', 'set', 'messages', 'aimed'], ['set', 'of', 'aimed', 'at'], ['of', 'messages', 'at', 'influencing'], ['messages', 'aimed', 'influencing', 'the'], ['aimed', 'at', 'the', 'opinions']]
labels: ['a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
7.2 定义超参数
batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
# How many words to consider left and right.
# Skip gram by design does not require to have all the context words in a given step
# However, for CBOW that's a requirement, so we limit the window size
window_size = 2
# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50
# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)
num_sampled = 32 # Number of negative examples to sample.
batch_size
: 一个batch中的样本数
embedding_size
: 嵌入向量(隐藏层)的大小
window_size
: 上下文大小
7.3 定义输入与输出
tf.reset_default_graph()
# Training input data (target word IDs). Note that it has 2*window_size columns
train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*window_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
train_dataset
: 每一个batch中输入的训练数据集,大小为
128
×
4
128 \times 4
128×4,(每一个输入有4个上下文)
train_labels
: 输入的标签,大小为
128
×
1
128 \times 1
128×1,(每4个输入的上下文只有一个输出单词)
valid_dataset
: 验证集
7.4 定义模型参数和其他变量
embeddings
: 输入层到隐藏层的权重矩阵
W
W
W,
V
×
N
V \times N
V×N,均匀分布,分布在
[
−
1
,
1
]
[-1, 1]
[−1,1]。
softmax_weights
: 隐藏层到输出层的权重
W
′
W'
W′,
V
×
N
V \times N
V×N,截断正态分布,均值0,标准差
0.5
128
\frac{0.5}{\sqrt{128}}
1280.5
softmax_biases
: 输出层的偏置
b
b
b,
V
V
V,均匀分布,分布在
[
0
,
0.01
]
[0, 0.01]
[0,0.01]
# Variables.
# Embedding layer, contains the word embeddings
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0,dtype=tf.float32))
# Softmax Weights and Biases
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
stddev=0.5 / math.sqrt(embedding_size),dtype=tf.float32))
softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))
7.5 定义模型计算
# Model.
# Look up embeddings for a batch of inputs.
# Here we do embedding lookups for each column in the input placeholder
# and then average them to produce an embedding_size word vector
stacked_embedings = None # 每一个目标词汇的上下文向量构成的矩阵
print('Defining %d embedding lookups representing each word in the context'%(2*window_size))
for i in range(2*window_size):
embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i]) # 取出train_dataset中第i列的词语的embedding向量
x_size,y_size = embedding_i.get_shape().as_list()
if stacked_embedings is None:
stacked_embedings = tf.reshape(embedding_i,[x_size,y_size,1])
else:
stacked_embedings = tf.concat(axis=2,values=[stacked_embedings,tf.reshape(embedding_i,[x_size,y_size,1])])
assert stacked_embedings.get_shape().as_list()[2]==2*window_size
print("Stacked embedding size: %s"%stacked_embedings.get_shape().as_list())
mean_embeddings = tf.reduce_mean(stacked_embedings,2,keepdims=False) # 对输入的向量进行求和再平均
print("Reduced mean embedding size: %s"%mean_embeddings.get_shape().as_list())
# Compute the softmax loss, using a sample of the negative labels each time.
# inputs are embeddings of the train words
# with this loss we optimize weights, biases, embeddings
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=mean_embeddings,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
输出如下:
Defining 4 embedding lookups representing each word in the context
Stacked embedding size: [128, 128, 4]
Reduced mean embedding size: [128, 128]
这一步主要是求得输入的向量。因为CBOW是输入多个上下文,这些上下文需要求平均数才能够提交给隐藏层。这个for
循环的作用就是将输入向量给拼接起来,之后再用tf.reduce_mean()
求平均数。
而损失函数与skip-gram模型定义的一样,就不多赘述。
7.6 模型参数优化器
这里同样采用Adagrad作为优化器。
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
7.7 计算词语相似度
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
这里与之前定义的一模一样,也不多赘述。
7.8 运行CBOW模型
num_steps = 100001
cbow_losses = []
# ConfigProto is a way of providing various configuration settings
# required to execute the graph
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
# Initialize the variables in the graph
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
# Train the Word2vec model for num_step iterations
for step in range(num_steps):
# Generate a single batch of data
batch_data, batch_labels = generate_batch_cbow(batch_size, window_size)
# Populate the feed_dict and run the optimizer (minimize loss)
# and compute the loss
feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
# Update the average loss variable
average_loss += l
if (step+1) % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
cbow_losses.append(average_loss)
print('Average loss at step %d: %f' % (step+1, average_loss))
average_loss = 0
# Evaluating validation set word similarities
if (step+1) % 10000 == 0:
sim = similarity.eval()
# Here we compute the top_k closest words for a given validation word
# in terms of the cosine distance
# We do this for all the words in the validation set
# Note: This is an expensive step
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
cbow_final_embeddings = normalized_embeddings.eval()
np.save('cbow_embeddings',cbow_final_embeddings)
with open('cbow_losses.csv', 'wt') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(cbow_losses)
这里和skip-gram唯一的区别就在于生成的batch_data
与batch_labels
不同,于是loss
中传入的参数也就不同。
7.9 可视化CBOW的学习过程
7.9.1 采用sklearn计算词嵌入t-SNE可视化
num_points = 1000 # we will use a large sample space to build the T-SNE manifold and then prune it using cosine similarity
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
print('Fitting embeddings to T-SNE. This can take some time ...')
# get the T-SNE manifold
selected_embeddings = cbow_final_embeddings[:num_points, :] # 提取前1000个嵌入向量
two_d_embeddings = tsne.fit_transform(selected_embeddings)
print('Pruning the T-SNE embeddings')
# prune the embeddings by getting ones only more than n-many sample above the similarity threshold
# this unclutters the visualization
selected_ids = find_clustered_embeddings(selected_embeddings,.25,10) # 获得聚类单词的id
two_d_embeddings = two_d_embeddings[selected_ids,:]
print('Out of ',num_points,' samples, ', selected_ids.shape[0],' samples were selected by pruning')
7.9.2 采用matplotlib绘制t-SNE
def plot(embeddings, labels):
n_clusters = 20 # number of clusters
# automatically build a discrete set of colors, each for cluster
label_colors = [pylab.cm.Spectral(float(i) /n_clusters) for i in range(n_clusters)]
assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
# Define K-Means
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(embeddings)
kmeans_labels = kmeans.labels_
pylab.figure(figsize=(15,15)) # in inches
# plot all the embeddings and their corresponding words
for i, (label,klabel) in enumerate(zip(labels,kmeans_labels)):
x, y = embeddings[i,:]
pylab.scatter(x, y, c=label_colors[klabel])
pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
ha='right', va='bottom',fontsize=10)
# use for saving the figure if needed
#pylab.savefig('word_embeddings.png')
pylab.show()
words = [reverse_dictionary[i] for i in selected_ids]
plot(two_d_embeddings, words)
8 参考
[1] Thushan Ganegedara. Natural Language Processing with TensorFlow (TensorFlow自然语言处理)[M]. 北京: 机械工业出版社, 2019: 42-46.
[2] 野指针小李. Word2Vec原理与公式详细推导[EB/OL]. (2021-04-28)[2021-06-18]. https://blog.csdn.net/qq_35357274/article/details/116240180
[3] 野指针小李. Word2Vec之Hierarchical Softmax与Negative Sampling[EB/OL]. (2021-05-03)[2021-06-18]. https://blog.csdn.net/qq_35357274/article/details/116381205
[4] 101欢欢鱼. python——random.sample()的用法[EB/OL]. (2019-08-12)[2021-06-18]. https://www.cnblogs.com/fish-101/p/11339909.html
[5] TaoTao Yu. embedding_lookup的学习笔记[EB/OL]. (2019-08-04)[2021-06-18]. https://blog.csdn.net/hit0803107/article/details/98377030
[6] 阿常呓语. collections中 deque的使用[EB/OL]. (2018-06-21)[2021-06-18]. https://blog.csdn.net/u010339879/article/details/80767293