本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。
对了宝贝儿们,卑微小李的公众号【野指针小李】已开通,期待与你一起探讨学术哟~摸摸大!
目录
0 前言
本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。在作者代码的基础上,我添加了部分自己的注释(作者的注释是英文,我的注释是用的中文)。代码已上传至github,这里是链接。
如果有任何错误或者没有讲解清楚的部分,请评论在下方,看到后我会更改。
关于GloVe的原理,如果有疑问的同学,可以参考我之前的文章:GloVe原理与公式讲解。
TensorFlow版本是1.8.0。
1 数据集下载
url = 'http://www.evanjones.ca/software/'
def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
if not os.path.exists(filename):
filename, _ = urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified %s' % filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
return filename
filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)
不愿意采用这种方式下载的同学,也可以直接访问链接 http://www.evanjones.ca/software/wikipedia2text-extracted.txt.bz2 进行下载
2 读取数据集
该步骤主要包含:将数据读取出来成为string,将数据转换为小写,对数据进行分词操作。每次读取1M数据。
def read_data(filename):
"""
Extract the first file enclosed in a zip file as a list of words
and pre-processes it using the nltk python library
"""
with bz2.BZ2File(filename) as f:
data = []
file_size = os.stat(filename).st_size
chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large
print('Reading data...')
for i in range(ceil(file_size//chunk_size)+1):
bytes_to_read = min(chunk_size,file_size-(i*chunk_size))
file_string = f.read(bytes_to_read).decode('utf-8')
file_string = file_string.lower() # 将数据转换为小写
# tokenizes a string to words residing in a list
file_string = nltk.word_tokenize(file_string) # 分词
data.extend(file_string)
return data
words = read_data(filename)
print('Data size %d' % len(words))
token_count = len(words)
print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])
输出结果:
Reading data...
Data size 3361192
Example words (start): ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end): ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']
3 创建词典
根据以下的规则进行词典的创建. 为了方便理解以下的元素,采用 "I like to go to school"作为例子.
dictionary
: 词语与ID之间的映射关系 (e.g. {‘I’: 0, ‘like’: 1, ‘to’: 2, ‘go’: 3, ‘school’: 4})reverse_dictionary
: ID与词语之间的映射关系 (e.g. {0: ‘I’, 1: ‘like’, 2: ‘to’, 3: ‘go’, 4: ‘school’})count
: 列表,列表中每个元素是个元组,每个元组中的元素为单词以及频率 (word, frequency) (e.g. [(‘I’, 1), (‘like’, 1), (‘to’, 2), (‘go’, 1), (‘school’, 1)])data
: 文本中的词语,这些词语以ID来代替 (e.g. [0, 1, 2, 3, 2, 4])
标记 UNK
来表示稀有词语。
词典中只统计50000个常见词。
# we restrict our vocabulary size to 50000
vocabulary_size = 50000
def build_dataset(words):
count = [['UNK', -1]]
# Gets only the vocabulary_size most common words as the vocabulary
# All the other words will be replaced with UNK token
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
# Create an ID for each word by giving the current length of the dictionary
# And adding that item to the dictionary
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
# Traverse through all the text we have and produce a list
# where each element corresponds to the ID of the word found at that index
for word in words:
# If word is in the dictionary use the word ID,
# else use the ID of the special token "UNK"
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
# update the count variable with the number of UNK occurences
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
# Make sure the dictionary is of size of the vocabulary
assert len(dictionary) == vocabulary_size
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words # Hint to reduce memory.
因为作者是同一个人,又是同一份代码,之前Word2Vec中我也写了这块代码的运行逻辑,链接为:《TensorFlow学习笔记(3)——TensorFlow实现Word2Vec》,第4部分。如果对这个代码有疑惑的可以跳转链接过去看。
输出结果为:
Most common words (+UNK) [['UNK', 68751], ('the', 226893), (',', 184013), ('.', 120919), ('of', 116323)]
Sample data [1721, 9, 8, 16479, 223, 4, 5168, 4459, 26, 11597]
4 生成GloVe的batch数据
batch
是中心词;labels
是中心词上下文窗口中的词语。对于中心词的上下文,每次读取2 * window_size + 1
个词语,称之为span
。每个span
中,中心词为1
,上下文大小为2 * window_size
。该函数以这种方式继续,直到创建batch_size
数据点。每次到达单词序列的末尾时,我们都会从头开始。
batch
:
1
×
8
1 \times 8
1×8的向量; labels
:
8
×
1
8 \times 1
8×1的向量; weights
:
1
×
8
1 \times 8
1×8的向量,词语
i
i
i与词语
j
j
j共现的次数,
1
d
\frac{1}{d}
d1,其中
d
d
d为两个词之间的距离。
data_index = 0
def generate_batch(batch_size, window_size):
# data_index is updated by 1 everytime we read a data point
global data_index
# two numpy arras to hold target words (batch)
# and context words (labels)
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
weights = np.ndarray(shape=(batch_size), dtype=np.float32)
# span defines the total window size, where
# data we consider at an instance looks as follows.
# [ skip_window target skip_window ]
span = 2 * window_size + 1
# The buffer holds the data contained within the span
buffer = collections.deque(maxlen=span)
# Fill the buffer and update the data_index
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
# This is the number of context words we sample for a single target word
num_samples = 2*window_size
# We break the batch reading into two for loops
# The inner for loop fills in the batch and labels with
# num_samples data points using data contained withing the span
# The outper for loop repeat this for batch_size//num_samples times
# to produce a full batch
for i in range(batch_size // num_samples):
k=0
# avoid the target word itself as a prediction
# fill in batch and label numpy arrays
for j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):
batch[i * num_samples + k] = buffer[window_size]
labels[i * num_samples + k, 0] = buffer[j]
# 因为 j 是跳过了 window_size 的,所以 j - window_size 不会为0
weights[i * num_samples + k] = abs(1.0/(j - window_size))
k += 1
# Everytime we read num_samples data points,
# we have created the maximum number of datapoints possible
# withing a single span, so we need to move the span by 1
# to create a fresh new span
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels, weights
print('data:', [reverse_dictionary[di] for di in data[:9]])
for window_size in [2, 4]:
data_index = 0
batch, labels, weights = generate_batch(batch_size=8, window_size=window_size)
print('\nwith window_size = %d:' %window_size)
print(' batch:', [reverse_dictionary[bi] for bi in batch])
print(' labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
print(' weights:', [w for w in weights])
在这里weights
就体现出了论文中所写的:
In all cases we use a decreasing weighting function, so that word pairs that are d d d words apart contribute 1 / d 1/d 1/d to the total count.
输出数据:
data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']
with window_size = 2:
batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']
labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']
weights: [0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5]
with window_size = 4:
batch: ['set', 'set', 'set', 'set', 'set', 'set', 'set', 'set']
labels: ['propaganda', 'is', 'a', 'concerted', 'of', 'messages', 'aimed', 'at']
weights: [0.25, 0.33333334, 0.5, 1.0, 1.0, 0.5, 0.33333334, 0.25]
这个输出数据,我们以window_size = 4
来看,在这一个窗口中中心词为set
,其左侧上下文有['propaganda', 'is', 'a', 'concerted']
,右侧上下文有['of', 'messages', 'aimed', 'at']
。batch
与labels
中的数据是一一对应的(比如labels[0]
是batch[0]
的上下文)。而以propaganda
为例,距离set
为4(
4
−
0
=
4
4-0=4
4−0=4),所以weights[0]=1/4=0.25
。window_size=2
同理。
5 生成共现概率矩阵
# We are creating the co-occurance matrix as a compressed sparse colum matrix from scipy.
cooc_data_index = 0
dataset_size = len(data) # We iterate through the full text
skip_window = 4 # How many words to consider left and right.
# The sparse matrix that stores the word co-occurences
cooc_mat = lil_matrix((vocabulary_size, vocabulary_size), dtype=np.float32)
print(cooc_mat.shape)
def generate_cooc(batch_size, skip_window):
'''
Generate co-occurence matrix by processing batches of data
'''
data_index = 0
print('Running %d iterations to compute the co-occurance matrix'%(dataset_size//batch_size))
for i in range(dataset_size//batch_size):
# Printing progress
if i>0 and i%100000==0:
print('\tFinished %d iterations'%i)
# Generating a single batch of data
batch, labels, weights = generate_batch(batch_size, skip_window)
labels = labels.reshape(-1)
# Incrementing the sparse matrix entries accordingly
# inp: 中心词 i 的 id
# lbl: 上下文词语 j 的 id
# w: i 与 j 共现的频率
for inp,lbl,w in zip(batch,labels,weights):
cooc_mat[inp,lbl] += (1.0*w)
# Generate the matrix
generate_cooc(8,skip_window)
# Just printing some parts of co-occurance matrix
print('Sample chunks of co-occurance matrix')
# Basically calculates the highest cooccurance of several chosen word
for i in range(10):
idx_target = i
# get the ith row of the sparse matrix and make it dense
ith_row = cooc_mat.getrow(idx_target)
ith_row_dense = ith_row.toarray('C').reshape(-1) # 获得频率,如果ith_row没有这个元素,那么就是0
# select target words only with a reasonable words around it.
# 获得一个 X_i 在 10 - 50000 之间的单词
while np.sum(ith_row_dense)<10 or np.sum(ith_row_dense)>50000:
# Choose a random word
idx_target = np.random.randint(0,vocabulary_size)
# get the ith row of the sparse matrix and make it dense
ith_row = cooc_mat.getrow(idx_target)
ith_row_dense = ith_row.toarray('C').reshape(-1)
print('\nTarget Word: "%s"'%reverse_dictionary[idx_target])
# sort_indices 按照从小到大排序 ith_row_dense (词频从小到大排序), 结果为索引
sort_indices = np.argsort(ith_row_dense).reshape(-1) # indices with highest count of ith_row_dense
# 按照从大到小排序 ith_row_dense (词频从大到小排序), 结果为索引
sort_indices = np.flip(sort_indices,axis=0) # reverse the array (to get max values to the start)
# printing several context words to make sure cooc_mat is correct
print('Context word:',end='')
for j in range(10):
idx_context = sort_indices[j]
print('"%s"(id:%d,count:%.2f), '%(reverse_dictionary[idx_context],idx_context,ith_row_dense[idx_context]),end='')
print()
这里作者是采用了scipy.sparse
中的lil_matrix
,因为原论文中提到,这个共现概率矩阵是个稀疏矩阵,所以采用lil_matrix
可以节省存储空间。lil_matrix(arg1, shape=None, dtype=None, copy=False)
, 基于行连接存储的稀疏矩阵。lil_matrix
使用两个列表保存非零元素。data保存每行中的非零元素,rows保存非零元素所在的列。这种格式也很适合逐个添加元素,并且能快速获取行相关的数据[4]。简单来说就是lil_matrix
只存储非零元素的行列以及元素,其余位置全为0。
输出的部分结果为:
(50000, 50000)
Running 420149 iterations to compute the co-occurance matrix
Finished 100000 iterations
Finished 200000 iterations
Finished 300000 iterations
Finished 400000 iterations
Sample chunks of co-occurance matrix
...
Target Word: "to"
Context word:"the"(id:1,count:2481.16), ","(id:2,count:989.33), "."(id:3,count:689.00), "a"(id:8,count:579.83), "and"(id:5,count:573.08), "be"(id:30,count:553.83), "of"(id:4,count:470.50), "UNK"(id:0,count:470.00), "in"(id:6,count:412.25), "is"(id:9,count:283.42),
这里的代码逻辑很简单,就是每一次抓8
个数据计算(batch_size
),那么一共需要抓取420149
次。每一次抓取,都可以得到一个中心词及其8
个上下文(window_size=4
),以及在这个窗口中中心词与上下文共现的频率,接着更新共现概率矩阵。
6 GloVe算法
6.1 定义超参数
batch_size
: 每个 batch 中的样本数;embedding_size
: 嵌入层向量的大小;window_size
: 上下文窗口大小;valid_examples
: 随机选择的验证集样本(随机选择后就是常量了);epsilon
: 防止
l
o
g
{\rm log}
log 发散。
batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.
# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50
# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)
num_sampled = 32 # Number of negative examples to sample.
epsilon = 1 # used for the stability of log in the loss function
6.2 定义输入与输出
为每一个batch_size
的内容创建训练集中输入与输出的placeholders
,并且为验证集创建一个常数的tensor。
tf.reset_default_graph()
# Training input data (target word IDs).
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
这里的valid_dataset
就对应了6.1中的valid_examples
。而这里的train_dataset
和train_labels
是用于每个batch中查询词向量用的,详细见6.4。
6.3 定义模型参数以及其他变量
in_embeddings
:
W
W
W,
50000
×
128
50000 \times 128
50000×128; in_bias_embeddings
:
b
b
b,
50000
×
1
50000 \times 1
50000×1; out_embeddings
:
W
~
\tilde{W}
W~,
50000
×
128
50000 \times 128
50000×128; out_bias_embeddings
:
b
~
\tilde{b}
b~,
50000
50000
50000
词向量初始化都是 [ − 1 , 1 ] [-1, 1] [−1,1]的均匀分布,偏置初始化都是 [ 0 , 0.01 ] [0, 0.01] [0,0.01]的均匀分布
# Variables.
in_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
in_bias_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')
out_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
out_bias_embeddings = tf.Variable(
tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')
这里定义了词向量 W W W 和 W ~ \tilde{W} W~,还有损失函数中的偏置项 b b b 和 b ~ \tilde{b} b~ 。
6.4 定义模型计算
定义了4个查找方法:embed_in
, embed_out
, embed_bias_in
, embed_bias_out
。
weights_x
:
1
×
8
1 \times 8
1×8, 权重函数
f
(
X
i
j
)
f(X_{ij})
f(Xij)
x_ij
:
1
×
8
1 \times 8
1×8, 词语
i
i
i 与
j
j
j 的共现频率,
X
i
j
X_{ij}
Xij
损失函数: J = ∑ i , j = 1 V f ( X i j ) ( w i T w ~ j + b i + b ~ j − l o g ( 1 + X i j ) ) 2 J=\sum_{i, j=1}^V f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - {\rm log}(1 + X_{ij}))^2 J=∑i,j=1Vf(Xij)(wiTw~j+bi+b~j−log(1+Xij))2
# Look up embeddings for inputs and outputs
# Have two seperate embedding vector spaces for inputs and outputs
embed_in = tf.nn.embedding_lookup(in_embeddings, train_dataset)
embed_out = tf.nn.embedding_lookup(out_embeddings, train_labels)
embed_bias_in = tf.nn.embedding_lookup(in_bias_embeddings,train_dataset)
embed_bias_out = tf.nn.embedding_lookup(out_bias_embeddings,train_labels)
# weights used in the cost function
weights_x = tf.placeholder(tf.float32,shape=[batch_size],name='weights_x')
# Cooccurence value for that position
x_ij = tf.placeholder(tf.float32,shape=[batch_size],name='x_ij')
# Compute the loss defined in the paper. Note that
# I'm not following the exact equation given (which is computing a pair of words at a time)
# I'm calculating the loss for a batch at one time, but the calculations are identical.
# I also made an assumption about the bias, that it is a smaller type of embedding
loss = tf.reduce_mean(
weights_x * (tf.reduce_sum(embed_in*embed_out,axis=1) + embed_bias_in + embed_bias_out - tf.log(epsilon+x_ij))**2)
这里就是用每个batch中的train_dataset
和train_labels
来查询词向量以及偏置向量,将查询到的内容放入到损失函数中进行计算。由于原论文中提到
l
o
g
(
0
)
{\rm log}(0)
log(0)是发散的,所以采用
l
o
g
(
1
+
X
i
j
)
{\rm log}(1 + X_{ij})
log(1+Xij)解决这个问题。
6.5 相似度计算
这一部分主要是采用余弦相似度计算词语的相似度,详细的内容在6.7中。
# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
embeddings = (in_embeddings + out_embeddings)/2.0 # X = U + V
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True)) # 矩阵中每行元素的模
normalized_embeddings = embeddings / norm # L2正则化
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) # 提取验证集中的数据
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings)) # 余弦相似度
这里计算余弦相似度主要是采用了L2正则化,使得 ∣ A ⃗ ∣ × ∣ B ⃗ ∣ = 1 |\vec{A}|\times|\vec{B}|=1 ∣A∣×∣B∣=1,从而得到两个词语的余弦相似度。
6.6 定义模型参数优化器
在这里采用了Adagrad优化器。
# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
6.7 运行GloVe模型
训练数据,训练num_steps
次。并且在每次迭代中,在一个固定的验证集中评估算法,并且打印出距离给定词语最近的词语。
从结果来看,随着训练的进行,最接近验证集中词语的词语是一直在发生改变的。
num_steps = 100001
glove_loss = []
average_loss = 0
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
# generate a single batch (data,labels,co-occurance weights)
batch_data, batch_labels, batch_weights = generate_batch(
batch_size, skip_window)
# 因为已经计算出来了共现矩阵,所以这里不需要 batch_weights
# Computing the weights required by the loss function
batch_weights = [] # weighting used in the loss function
batch_xij = [] # weighted frequency of finding i near j
# Compute the weights for each datapoint in the batch
for inp,lbl in zip(batch_data,batch_labels.reshape(-1)):
# 100: x_max, 0.75: 3/4, point_weight: f(X_ij), batch_xij: 词语 i 与 j 的频率
point_weight = (cooc_mat[inp,lbl]/100.0)**0.75 if cooc_mat[inp,lbl]<100.0 else 1.0
batch_weights.append(point_weight)
batch_xij.append(cooc_mat[inp,lbl])
batch_weights = np.clip(batch_weights,-100,1)
batch_xij = np.asarray(batch_xij)
# Populate the feed_dict and run the optimizer (minimize loss)
# and compute the loss. Specifically we provide
# train_dataset/train_labels: training inputs and training labels
# weights_x: measures the importance of a data point with respect to how much those two words co-occur
# x_ij: co-occurence matrix value for the row and column denoted by the words in a datapoint
feed_dict = {train_dataset : batch_data.reshape(-1), train_labels : batch_labels.reshape(-1),
weights_x:batch_weights,x_ij:batch_xij}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
# Update the average loss variable
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
glove_loss.append(average_loss)
average_loss = 0
# Here we compute the top_k closest words for a given validation word
# in terms of the cosine distance
# We do this for all the words in the validation set
# Note: This is an expensive step
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()
部分输出结果(这里选用第0次训练,即初始情况,以及第100000次训练的结果):
Average loss at step 0: 8.672687
Nearest to ,: pitcher, discharges, pigs, tolerant, fuzzy, medium-, on-campus, eduskunta,
Nearest to this: mediastinal, destined, implementing, honolulu, non-mormon, juniors, tycho, powered,
Nearest to most: translating, absolute, 111, bechet, adam, aleksey, penetrators, rake,
Nearest to but: motown, ridged, beginnings, shareholder, resurfacing, english, intelligence, o'dea,
Nearest to is: higher-quality, kitchener, kelley, confronted, m15, stanislaus, depictions, buf,
Nearest to ): encyclopedic, commute, symbiotic, forecasts, 1993., 243-year, cenwealh, inclosure,
Nearest to not: toulon, discount, dunblane, vividly, recorded, olive, afrikaansche, german-speaking,
Nearest to with: tofu, expansive, penned, grids, 102, drought, merced, cunningham,
Nearest to ;: all-electric, internationally-recognised, czars, 12–16, kana, immaculate, innings, wnba,
Nearest to a: non-residents, presumption, cephas, tau, stepfather, beside, aorist, vom,
Nearest to for: bitterroots, sx-64, weekday, edificio, sousley, self-proclaimed, whoever, liquid,
Nearest to have: dissenting, barret, psilocybin, massamba-débat, kopfstein, 5.5, fillmore, innovator,
Nearest to was: ., is, most, wheelchair, 1575, warm-blooded, dynamically, 1913.,
Nearest to 's: eoka, melancholia, downs, gallipoli, reichswehr, easter, chest, construed,
Nearest to were: 1138, djuna, 3, beni, high-grade, slander, agency, séamus,
Nearest to be: knelt, horrors, assistant, hospitalised, 1802, fierce, cinemas, magnified,
...
Average loss at step 100000: 0.019544
Nearest to ,: ., the, in, a, of, and, ,, is,
Nearest to this: ), (, ``, UNK, or, ., in, ,,
Nearest to most: ., the, of, ,, and, for, a, to,
Nearest to but: ), UNK, '', or, and, ,, in, .,
Nearest to is: 's, the, of, at, world, ., in, on,
Nearest to ): were, in, ., and, ,, the, by, is,
Nearest to not: (, ``, UNK, ), '', of, 's, the,
Nearest to with: been, had, to, has, be, that, a, may,
Nearest to ;: a, such, an, ,, for, and, with, is,
Nearest to a: the, was, ., in, and, ,, to, of,
Nearest to for: are, by, and, ,, in, to, the, was,
Nearest to have: is, was, that, also, this, not, has, a,
Nearest to was: ., of, in, and, ,, 's, for, to,
Nearest to 's: it, is, has, there, this, are, was, not,
Nearest to were: a, as, is, with, and, ,, to, for,
Nearest to be: was, it, when, had, that, his, in, ,,
根据结果,我们发现,随着训练的进行,最接近于中心词的词语是在发生改变的,且越来越相似(比如初始化中最接近于be
的词语是莫名其妙的,而100000次后有了was
)。
而整体的代码逻辑上来讲:
- 每一轮迭代生成一个中心词
batch_data
及其窗口中的上下文batch_labels
; - 迭代这组词,得到权重函数 f ( X i j ) = ( X i j x m a x ) 0.75 f(X_{ij})=(\frac{X_{ij}}{x_{\rm max}})^{0.75} f(Xij)=(xmaxXij)0.75,并且提取出共现频率 X i j X_{ij} Xij;
- 通过
np.clip()
限制权重函数的最大值不超过1; - 将数据放入到6.4定义的损失函数
loss
中以及6.6定义的优化器optimizer
中训练。
参考
[1] Thushan Ganegedara. Natural Language Processing with TensorFlow (TensorFlow自然语言处理)[M]. 北京: 机械工业出版社, 2019: 88-90.
[2] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014.
[3] AI研习社-译站. 【官方】【中英】CS224n 斯坦福深度自然语言处理课 @雷锋字幕组[EB/OL]. (2019-01-22)[2021-07-06]. https://www.bilibili.com/video/BV1pt411h7aT?p=3
[4] -柚子皮-. SciPy教程 - 稀疏矩阵库scipy.sparse[EB/OL]. (2014-12-06)[2021-07-08]. https://blog.csdn.net/pipisorry/article/details/41762945
[5] TaoTao Yu. embedding_lookup的学习笔记[EB/OL]. (2019-08-04)[2021-07-08]. https://blog.csdn.net/hit0803107/article/details/98377030