转自https://blog.csdn.net/weixin_42001089/article/details/81224869
word2vec有CBOW与Skip-Gram模型
CBOW是根据上下文预测中间值,Skip-Gram则恰恰相反
本文首先介绍Skip-Gram模型,是基于tensorflow官方提供的一个demo,第二大部分是经过简单修改的CBOW模型,主要参考:
https://www.cnblogs.com/pinard/p/7160330.html
两部分以###########################为界限
好了,现在开始!!!!!!
###################################################################################################
tensorflow官方demo:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/tutorials/word2vec
(一)首先:就是导入一些包没什么可说的
-
from __future__
import absolute_import
-
from __future__
import division
-
from __future__
import print_function
-
-
import collections
-
import math
-
import os
-
import sys
-
import argparse
-
import random
-
from tempfile
import gettempdir
-
import zipfile
-
-
import numpy as np
-
from six.moves
import urllib
-
from six.moves
import xrange # pylint: disable=redefined-builtin
-
import tensorflow as tf
-
-
from tensorflow.contrib.tensorboard.plugins
import projector
(二)接下来就是获取当前路径,以及创建log目录(主要用于后续的tensorboard可视化),默认log目录在当前目录下:
-
current_path = os.path.dirname(os.path.realpath(sys.argv[
0]))
-
-
parser = argparse.ArgumentParser()
-
parser.add_argument(
-
'--log_dir',
-
type=str,
-
default=os.path.join(current_path,
'log'),
-
help=
'The log directory for TensorBoard summaries.')
-
FLAGS, unparsed = parser.parse_known_args()
-
-
# Create the directory
for TensorBoard variables
if there
is
not.
-
if
not os.path.exists(FLAGS.log_dir):
-
os.makedirs(FLAGS.log_dir)
sys.argv[]就是一个从程序外部获取参数的桥梁,sys.argv[0]就是返回第一个参数,即获取当前脚本
关于其更多用法可以参考:https://www.cnblogs.com/aland-1415/p/6613449.html
os.path.realpath就是获取脚本的绝对路径
parser.parse_known_args()用 来解析不定长的命令行参数,其返回的是2个参数,第一个参数是已经定义了的参数,第二个是没有定义的参数。
具体到这里举个例子就是:写一个test.py
-
import argparse
-
import os
-
import sys
-
-
current_path = os.path.dirname(os.path.realpath(sys.argv[
0]))
-
-
parser = argparse.ArgumentParser()
-
parser.add_argument(
-
'--log_dir',
-
type=str,
-
default=os.path.join(current_path,
'log'),
-
help=
'The log directory for TensorBoard summaries.')
-
FLAGS, unparsed = parser.parse_known_args()
-
-
print(FLAGS)
-
print(unparsed)
(三)接下来是下载数据集(这里稍微做了一点修改):
-
def maybe_download(filename, expected_bytes):
-
"""Download a file if not present, and make sure it's the right size."""
-
if
not os.path.exists(filename):
-
filename, _ = urllib.request.urlretrieve(url + filename, filename)
-
# 获取文件相关属性
-
statinfo = os.stat(filename)
-
# 比对文件的大小是否正确
-
if statinfo.st_size == expected_bytes:
-
print(
'Found and verified', filename)
-
else:
-
print(statinfo.st_size)
-
raise Exception(
-
'Failed to verify ' + filename +
'. Can you get to it with a browser?')
-
return filename
-
-
filename = maybe_download(
'text8.zip',
31344016)
下载好后就会在当前文件夹下有一个叫做text8.zip的压缩包
(四)生成单词表
-
# Read the data into a list of strings.
-
def read_data(filename):
-
"""Extract the first file enclosed in a zip file as a list of words."""
-
with zipfile.ZipFile(filename)
as f:
-
data = tf.compat.as_str(f.read(f.namelist()[
0])).split()
-
return data
-
-
-
vocabulary = read_data(filename)
-
print(
'Data size', len(vocabulary))
f.namelist()[0]是解压后第一个文件,不过这里解压后本来就只有一个文件,然后以空格分开,所以最后的vocabulary中就是单词表,最后打印一下看看有多少单词
(五)建立有50000个词的字典,没在该词典的单词用UNK表示
-
vocabulary_size =
50000
-
-
def build_dataset(words, n_words):
-
"""Process raw inputs into a dataset."""
-
count = [[
'UNK',
-1]]
-
count.extend(collections.Counter(words).most_common(n_words -
1))
-
dictionary = dict()
-
for word, _
in count:
-
dictionary[word] = len(dictionary)
-
data = list()
-
unk_count =
0
-
for word
in words:
-
index = dictionary.get(word,
0)
-
if index ==
0:
# dictionary['UNK']
-
unk_count +=
1
-
data.append(index)
-
count[
0][
1] = unk_count
-
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
-
return data, count, dictionary, reversed_dictionary
-
-
data, count, dictionary, reverse_dictionary = build_dataset(
-
vocabulary, vocabulary_size)
-
del vocabulary
# Hint to reduce memory.
-
print(
'Most common words (+UNK)', count[:
5])
-
print(
'Sample data', data[:
10], [reverse_dictionary[i]
for i
in data[:
10]])
其中下面是统计每个单词的词频,并选取前50000个词频较高的单词作为字典的备选词
extend追加一个列表
count.extend(collections.Counter(words).most_common(n_words - 1))
data是将数据集的单词都编号,没有在字典的中单词编号为UNK(0)
就想这样;
i love tensorflow very much .........
2 23 UNK 3 45 .........
count 记录的是每个单词对应的词频比如;[ ['UNK', -1] , ['a','200'] , ['i',150],...............]
dictionary是一个字典:记录的是单词对应编号 即key:单词、value:编号(编号越小,词频越高,但第一个永远是UNK)
reversed_dictionary是一个字典:编号对应的单词 即key:编号、value:单词(编号越小,词频越高,但第一个永远是UNK)
第一个永远是UNK是因为extend追加一个列表,变化的是追加的列表,第一个永远是UNK
(六)取labels,分批次
-
data_index =
0
-
-
def generate_batch(batch_size, num_skips, skip_window):
-
global data_index
-
assert batch_size % num_skips ==
0
-
assert num_skips <=
2 * skip_window
-
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
-
labels = np.ndarray(shape=(batch_size,
1), dtype=np.int32)
-
span =
2 * skip_window +
1
# [ skip_window target skip_window ]
-
buffer = collections.deque(maxlen=span)
# pylint: disable=redefined-builtin
-
if data_index + span > len(data):
-
data_index =
0
-
buffer.extend(data[data_index:data_index + span])
-
data_index += span
-
for i
in range(batch_size // num_skips):
-
context_words = [w
for w
in range(span)
if w != skip_window]
-
words_to_use = random.sample(context_words, num_skips)
-
for j, context_word
in enumerate(words_to_use):
-
batch[i * num_skips + j] = buffer[skip_window]
-
labels[i * num_skips + j,
0] = buffer[context_word]
-
if data_index == len(data):
-
buffer.extend(data[
0:span])
-
data_index = span
-
else:
-
buffer.append(data[data_index])
-
data_index +=
1
-
# Backtrack a little bit to avoid skipping words in the end of a batch
-
data_index = (data_index + len(data) - span) % len(data)
-
return batch, labels
-
-
batch, labels = generate_batch(batch_size=
8, num_skips=
2, skip_window=
1)
-
for i
in range(
8):
-
print(batch[i], reverse_dictionary[batch[i]],
'->', labels[i,
0],
-
reverse_dictionary[labels[i,
0]])
batch_size:就是批次大小
num_skips:就是重复用一个单词的次数,比如 num_skips=2时,对于一句话:i love tensorflow very much ..........
当tensorflow被选为目标词时,在产生label时要利用tensorflow两次即:
tensorflow---》 love tensorflow---》 very
skip_window:是考虑左右上下文的个数,比如skip_window=1,就是在考虑上下文的时候,左面一个,右面一个
skip_window=2时,就是在考虑上下文的时候,左面两个,右面两个
span :其实在分批次的过程中可以看做是一个固定大小的框框(比较流行的说法数滑动窗口)在不断移动,而这个框框的大小 就是 span,可以看到span = 2 * skip_window + 1
buffer = collections.deque(maxlen=span):就是申请了一个buffer(其实就是固定大小的窗口这里是3)即每次这个buffer队列中最 多 能容纳span个单词
所以过程应该是这样的:比如batch_size=6, num_skips=2,skip_window=1,data:
batch_size // num_skips=3,循环3次
( I am looking for the missing glass-shoes who has picked it up .............)
2 23 56 3 45 84 123 45 23 12 1 14 ...............
i=0时:2 ,23 ,56首先进入 buffer( context_words = [w for w in range(span) if w != skip_window]的意思就是取窗口中不包括目标词 的词即上下文),然后batch[i * num_skips + j] = buffer[skip_window](skip_window=1,所以每次就是取窗口的中间数为 目标词)即batch=23, labels[i * num_skips + j, 0] = buffer[context_word]就是取其上下文为labels即2和56
所以此时batch=[23,23] labels=[2,56](当然也可能是[2,56],因为可能先取右边,后取左面),同时data_index=3即单词for的 位置
i=1时:data[data_index]进队列,即 buffer为 23,56,3 赋值后为:batch=[23,23,56,56] labels=[2,56,23,3](也可能是换一下顺序)
同时data_index=4即单词the
i=2时:data[data_index]进队列,即 buffer为 56,3,45 赋值后为:batch=[23,23,56,56,3,3] labels=[2,56,23,3,56,45](也可能是换一 下顺序) 同时data_index=5即单词missing
至此循环结束,按要求取出大小为6的一个批次即:
batch=[23,23,56,56,3,3] labels=[2,56,23,3,56,45]
然后data_index = (data_index + len(data) - span) % len(data)即data_index回溯3个单位,回到 looking,因为global data_index
所以data_index全局变量,所以当在取下一个批次的时候,buffer从looking的位置开始装载,即从上一个批次结束的位置接着往下取batch和labels
(七)定义一些参数大小:
-
batch_size =
128
-
embedding_size =
128 # Dimension
of the embedding vector.
-
skip_window =
1 # How many words
to consider left
and right.
-
num_skips =
2 # How many times
to reuse an input
to generate a
label.
-
num_sampled =
64 # Number
of negative examples
to sample.
-
-
graph = tf.Graph()
这里主要就是定义我们上面讲的一些参数的大小
(八)神经网络图model:
-
with graph.as_default():
-
-
# Input data.
-
with tf.name_scope(
'inputs'):
-
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
-
train_labels = tf.placeholder(tf.int32, shape=[batch_size,
1])
-
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
-
-
# Ops and variables pinned to the CPU because of missing GPU implementation
-
with tf.device(
'/cpu:0'):
-
# Look up embeddings for inputs.
-
with tf.name_scope(
'embeddings'):
-
embeddings = tf.Variable(
-
tf.random_uniform([vocabulary_size, embedding_size],
-1.0,
1.0))
-
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
-
-
# Construct the variables for the NCE loss
-
with tf.name_scope(
'weights'):
-
nce_weights = tf.Variable(
-
tf.truncated_normal(
-
[vocabulary_size, embedding_size],
-
stddev=
1.0 / math.sqrt(embedding_size)))
-
with tf.name_scope(
'biases'):
-
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
-
-
# Compute the average NCE loss for the batch.
-
# tf.nce_loss automatically draws a new sample of the negative labels each
-
# time we evaluate the loss.
-
# Explanation of the meaning of NCE loss:
-
# http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
-
with tf.name_scope(
'loss'):
-
loss = tf.reduce_mean(
-
tf.nn.nce_loss(
-
weights=nce_weights,
-
biases=nce_biases,
-
labels=train_labels,
-
inputs=embed,
-
num_sampled=num_sampled,
-
num_classes=vocabulary_size))
-
-
# Add the loss value as a scalar to summary.
-
tf.summary.scalar(
'loss', loss)
-
-
# Construct the SGD optimizer using a learning rate of 1.0.
-
with tf.name_scope(
'optimizer'):
-
optimizer = tf.train.GradientDescentOptimizer(
1.0).minimize(loss)
-
-
# Compute the cosine similarity between minibatch examples and all embeddings.
-
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings),
1, keepdims=
True))
-
normalized_embeddings = embeddings / norm
-
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
-
valid_dataset)
-
similarity = tf.matmul(
-
valid_embeddings, normalized_embeddings, transpose_b=
True)
-
-
# Merge all summaries.
-
merged = tf.summary.merge_all()
-
-
# Add variable initializer.
-
init = tf.global_variables_initializer()
-
-
# Create a saver.
-
saver = tf.train.Saver()
这里可以分为两部分来看,一部分是训练Skip-gram模型的词向量,另一部分是计算余弦相似度,下面我们分开说:
首先看下tf.nn.embedding_lookup的API解释:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/embedding_ops.py
-
def embedding_lookup(
-
params,
-
ids,
-
partition_strategy="mod",
-
name=None,
-
validate_indices=True, # pylint: disable=unused-argument
-
max_norm=None):
-
"""Looks up `ids` in a list of embedding tensors.
-
This function is used to perform parallel lookups on the list of
-
tensors in `params`. It is a generalization of
-
@{tf.gather},
where `
params`
is
-
interpreted
as a partitioning of a large embedding tensor. `
params` may be
-
a `PartitionedVariable`
as returned
by
using `tf.get_variable()` with a
-
partitioner.
-
If `len(
params) >
1`, each element `id` of `ids`
is partitioned between
-
the elements of `
params` according to the `partition_strategy`.
-
In all strategies,
if the id space does not evenly divide the number of
-
partitions, each of the first `(max_id +
1) % len(
params)` partitions will
-
be assigned one more id.
-
If `partition_strategy`
is `
"mod"`, we assign each id to partition
-
`p = id % len(
params)`. For instance,
-
13 ids are split across
5 partitions
as:
-
`[[
0,
5,
10], [
1,
6,
11], [
2,
7,
12], [
3,
8], [
4,
9]]`
-
If `partition_strategy`
is `
"div"`, we assign ids to partitions
in a
-
contiguous manner. In
this
case,
13 ids are split across
5 partitions
as:
-
`[[
0,
1,
2], [
3,
4,
5], [
6,
7,
8], [
9,
10], [
11,
12]]`
-
The results of the lookup are concatenated
into a dense
-
tensor. The returned tensor has shape `shape(ids) + shape(
params)[
1:]`.
看到 The results of the lookup are concatenated into a dense tensor. The returned tensor has shape `shape(ids) + shape(params)[1:]`.,即假如params是:100*28,sp_ids是[2,56,3] 那么返回的便是3*28即分别对应params的第3、57、4行
其实往下看会发现其主要调用的是 _embedding_lookup_and_transform函数
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
来重点看下tf.nn.nce_loss源码(这也是本demo中最核心的东西):
源码https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_impl.py
-
def nce_loss(weights,
-
biases,
-
labels,
-
inputs,
-
num_sampled,
-
num_classes,
-
num_true=1,
-
sampled_values=None,
-
remove_accidental_hits=False,
-
partition_strategy="mod",
-
name="nce_loss"):
-
"""Computes and returns the noise-contrastive estimation training loss.
-
See [Noise-contrastive estimation: A new estimation principle for
-
unnormalized statistical
-
models](http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf).
-
Also see our [Candidate Sampling Algorithms
-
Reference](https://www.tensorflow.org/extras/candidate_sampling.pdf)
-
A common
use
case
is
to
use this method
for training,
and calculate the
full
-
sigmoid loss
for evaluation
or inference.
In this
case, you must
set
-
`partition_strategy="div"`
for the two losses
to be
consistent,
as
in the
-
following example:
-
``
`python
-
if mode == "train":
-
loss = tf.nn.nce_loss(
-
weights=weights,
-
biases=biases,
-
labels=labels,
-
inputs=inputs,
-
...,
-
partition_strategy="div")
-
elif mode == "eval":
-
logits = tf.matmul(inputs, tf.transpose(weights))
-
logits = tf.nn.bias_add(logits, biases)
-
labels_one_hot = tf.one_hot(labels, n_classes)
-
loss = tf.nn.sigmoid_cross_entropy_with_logits(
-
labels=labels_one_hot,
-
logits=logits)
-
loss = tf.reduce_sum(loss, axis=1)
-
`
``
-
Note:
By
default this uses a
log-
uniform (Zipfian) distribution
for sampling,
-
so your labels must be sorted
in
order
of decreasing frequency
to achieve
-
good results.
For more details, see
-
@{tf.nn.log_uniform_candidate_sampler}.
-
Note:
In the
case
where
`num_true` >
1, we assign
to
each target
class
-
the target probability
1 /
`num_true` so that the target probabilities
-
sum
to
1 per-example.
-
Note: It would be useful
to
allow a
variable
number
of target classes per
-
example. We hope
to provide this functionality
in a future release.
-
For
now,
if you have a
variable
number
of target classes, you can
pad them
-
out
to a
constant
number
by either repeating them
or
by padding
-
with an otherwise
unused class.
-
Args:
-
weights: A
`Tensor`
of shape
`[num_classes, dim]`,
or a
list
of
`Tensor`
-
objects whose concatenation along
dimension
0 has shape
-
[num_classes, dim]. The (possibly-partitioned)
class embeddings.
-
biases: A
`Tensor`
of shape
`[num_classes]`. The
class biases.
-
labels: A
`Tensor`
of
type
`int64`
and shape
`[batch_size,
-
num_true]`. The target classes.
-
inputs: A
`Tensor`
of shape
`[batch_size, dim]`. The forward
-
activations
of the
input network.
-
num_sampled: An
`int`. The
number
of classes
to randomly
sample per batch.
-
num_classes: An
`int`. The
number
of possible classes.
-
num_true: An
`int`. The
number
of target classes per training example.
-
sampled_values: a tuple
of (
`sampled_candidates`,
`true_expected_count`,
-
`sampled_expected_count`) returned
by a
`*_candidate_sampler` function.
-
(
if
None, we
default
to
`log_uniform_candidate_sampler`)
-
remove_accidental_hits: A
`bool`. Whether
to remove
"accidental hits"
-
where a sampled
class equals one
of the target classes.
If
set
to
-
`True`, this
is a
"Sampled Logistic" loss instead
of NCE,
and we
are
-
learning
to generate
log-odds instead
of
log probabilities. See
-
our [Candidate Sampling Algorithms
Reference]
-
(https://www.tensorflow.org/extras/candidate_sampling.pdf).
-
Default
is False.
-
partition_strategy: A
string specifying the partitioning strategy, relevant
-
if
`len(weights) > 1`. Currently
`"div"`
and
`"mod"`
are supported.
-
Default
is
`"mod"`. See
`tf.nn.embedding_lookup`
for more details.
-
name: A
name
for the operation (optional).
-
Returns:
-
A
`batch_size`
1-D tensor
of per-example NCE losses.
-
"""
-
logits, labels = _compute_sampled_logits(
-
weights=weights,
-
biases=biases,
-
labels=labels,
-
inputs=inputs,
-
num_sampled=num_sampled,
-
num_classes=num_classes,
-
num_true=num_true,
-
sampled_values=sampled_values,
-
subtract_log_q=True,
-
remove_accidental_hits=remove_accidental_hits,
-
partition_strategy=partition_strategy,
-
name=name)
-
sampled_losses = sigmoid_cross_entropy_with_logits(
-
labels=labels, logits=logits, name="sampled_losses
")
-
# sampled_losses is batch_size x {true_loss, sampled_losses...}
-
# We sum out true and sampled losses.
-
return _sum_rows(sampled_losses)
-
首先来看一下API:
-
def nce_loss(weights,
-
biases,
-
labels,
-
inputs,
-
num_sampled,
-
num_classes,
-
num_true=1,
-
sampled_values=None,
-
remove_accidental_hits=False,
-
partition_strategy="mod",
-
name="nce_loss"):
假如现在输入数据是M*N(对应到我们这个demo就是说M=50000(词典单词数),N=128(word2vec的特征数))
那么:
weights:M*N
biases : N
labels : batch_size, num_true(num_true代表正样本的数量,本demo中为1)
inputs : batch_size *N
num_sampled: 采样的负样本
num_classes : M
sampled_values:是否用不同的采样器,即tuple(`sampled_candidates`, `true_expected_count` `sampled_expected_count`)
如果是None,这采用log_uniform_candidate_sampler
remove_accidental_hits:如果不下心采集到的负样本就是target,要不要舍弃
partition_strategy:并行策略问题。
再看一下返回的就是
一个batch_size内每一个类子的NCE losses
下面看一下其实现,主要由三部分构成:
_compute_sampled_logits-----------------------采样
sigmoid_cross_entropy_with_logits---------------------------logistic regression
_sum_rows------------------------------------------------------------求和。
(1)看一下_compute_sampled_logits
-
def _compute_sampled_logits(weights,
-
biases,
-
labels,
-
inputs,
-
num_sampled,
-
num_classes,
-
num_true=1,
-
sampled_values=None,
-
subtract_log_q=True,
-
remove_accidental_hits=False,
-
partition_strategy="mod",
-
name=None,
-
seed=None):
-
"""Helper function for nce_loss and sampled_softmax_loss functions.
-
Computes sampled output training logits and labels suitable for implementing
-
e.g. noise-contrastive estimation (see nce_loss) or sampled softmax (see
-
sampled_softmax_loss).
-
Note: In the case where num_true > 1, we assign to each target class
-
the target probability 1 / num_true so that the target probabilities
-
sum to 1 per-example.
-
Args:
-
weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`
-
objects whose concatenation along dimension 0 has shape
-
`[num_classes, dim]`. The (possibly-partitioned) class embeddings.
-
biases: A `Tensor` of shape `[num_classes]`. The (possibly-partitioned)
-
class biases.
-
labels: A `Tensor` of type `int64` and shape `[batch_size,
-
num_true]`. The target classes. Note that this format differs from
-
the `labels` argument of `nn.softmax_cross_entropy_with_logits_v2`.
-
inputs: A `Tensor` of shape `[batch_size, dim]`. The forward
-
activations of the input network.
-
num_sampled: An `int`. The number of classes to randomly sample per batch.
-
num_classes: An `int`. The number of possible classes.
-
num_true: An `int`. The number of target classes per training example.
-
sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,
-
`sampled_expected_count`) returned by a `*_candidate_sampler` function.
-
(if None, we default to `log_uniform_candidate_sampler`)
-
subtract_log_q: A `bool`. whether to subtract the log expected count of
-
the labels in the sample to get the logits of the true labels.
-
Default is True. Turn off for Negative Sampling.
-
remove_accidental_hits: A `bool`. whether to remove "accidental hits"
-
where a sampled class equals one of the target classes. Default is
-
False.
-
partition_strategy: A string specifying the partitioning strategy, relevant
-
if `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.
-
Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.
-
name: A name for the operation (optional).
-
seed: random seed for candidate sampling. Default to None, which doesn't set
-
the op-level random seed for candidate sampling.
-
Returns:
-
out_logits: `Tensor` object with shape
-
`[batch_size, num_true + num_sampled]`, for passing to either
-
`nn.sigmoid_cross_entropy_with_logits` (NCE) or
-
`nn.softmax_cross_entropy_with_logits_v2` (sampled softmax).
-
out_labels: A Tensor object with the same shape as `out_logits`.
-
"""
-
-
if isinstance(weights, variables.PartitionedVariable):
-
weights = list(weights)
-
if
not isinstance(weights, list):
-
weights = [weights]
-
-
with ops.name_scope(name,
"compute_sampled_logits",
-
weights + [biases, inputs, labels]):
-
if labels.dtype != dtypes.int64:
-
labels = math_ops.cast(labels, dtypes.int64)
-
labels_flat = array_ops.reshape(labels, [
-1])
-
-
# Sample the negative labels.
-
# sampled shape: [num_sampled] tensor
-
# true_expected_count shape = [batch_size, 1] tensor
-
# sampled_expected_count shape = [num_sampled] tensor
-
if sampled_values
is
None:
-
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
-
true_classes=labels,
-
num_true=num_true,
-
num_sampled=num_sampled,
-
unique=
True,
-
range_max=num_classes,
-
seed=seed)
-
# NOTE: pylint cannot tell that 'sampled_values' is a sequence
-
# pylint: disable=unpacking-non-sequence
-
sampled, true_expected_count, sampled_expected_count = (
-
array_ops.stop_gradient(s)
for s
in sampled_values)
-
# pylint: enable=unpacking-non-sequence
-
sampled = math_ops.cast(sampled, dtypes.int64)
-
-
# labels_flat is a [batch_size * num_true] tensor
-
# sampled is a [num_sampled] int tensor
-
all_ids = array_ops.concat([labels_flat, sampled],
0)
-
-
# Retrieve the true weights and the logits of the sampled weights.
-
-
# weights shape is [num_classes, dim]
-
all_w = embedding_ops.embedding_lookup(
-
weights, all_ids, partition_strategy=partition_strategy)
-
-
# true_w shape is [batch_size * num_true, dim]
-
true_w = array_ops.slice(all_w, [
0,
0],
-
array_ops.stack(
-
[array_ops.shape(labels_flat)[
0],
-1]))
-
-
sampled_w = array_ops.slice(
-
all_w, array_ops.stack([array_ops.shape(labels_flat)[
0],
0]), [
-1,
-1])
-
# inputs has shape [batch_size, dim]
-
# sampled_w has shape [num_sampled, dim]
-
# Apply X*W', which yields [batch_size, num_sampled]
-
sampled_logits = math_ops.matmul(inputs, sampled_w, transpose_b=
True)
-
-
# Retrieve the true and sampled biases, compute the true logits, and
-
# add the biases to the true and sampled logits.
-
all_b = embedding_ops.embedding_lookup(
-
biases, all_ids, partition_strategy=partition_strategy)
-
# true_b is a [batch_size * num_true] tensor
-
# sampled_b is a [num_sampled] float tensor
-
true_b = array_ops.slice(all_b, [
0], array_ops.shape(labels_flat))
-
sampled_b = array_ops.slice(all_b, array_ops.shape(labels_flat), [
-1])
-
-
# inputs shape is [batch_size, dim]
-
# true_w shape is [batch_size * num_true, dim]
-
# row_wise_dots is [batch_size, num_true, dim]
-
dim = array_ops.shape(true_w)[
1:
2]
-
new_true_w_shape = array_ops.concat([[
-1, num_true], dim],
0)
-
row_wise_dots = math_ops.multiply(
-
array_ops.expand_dims(inputs,
1),
-
array_ops.reshape(true_w, new_true_w_shape))
-
# We want the row-wise dot plus biases which yields a
-
# [batch_size, num_true] tensor of true_logits.
-
dots_as_matrix = array_ops.reshape(row_wise_dots,
-
array_ops.concat([[
-1], dim],
0))
-
true_logits = array_ops.reshape(_sum_rows(dots_as_matrix), [
-1, num_true])
-
true_b = array_ops.reshape(true_b, [
-1, num_true])
-
true_logits += true_b
-
sampled_logits += sampled_b
-
-
if remove_accidental_hits:
-
acc_hits = candidate_sampling_ops.compute_accidental_hits(
-
labels, sampled, num_true=num_true)
-
acc_indices, acc_ids, acc_weights = acc_hits
-
-
# This is how SparseToDense expects the indices.
-
acc_indices_2d = array_ops.reshape(acc_indices, [
-1,
1])
-
acc_ids_2d_int32 = array_ops.reshape(
-
math_ops.cast(acc_ids, dtypes.int32), [
-1,
1])
-
sparse_indices = array_ops.concat([acc_indices_2d, acc_ids_2d_int32],
1,
-
"sparse_indices")
-
# Create sampled_logits_shape = [batch_size, num_sampled]
-
sampled_logits_shape = array_ops.concat(
-
[array_ops.shape(labels)[:
1],
-
array_ops.expand_dims(num_sampled,
0)],
0)
-
if sampled_logits.dtype != acc_weights.dtype:
-
acc_weights = math_ops.cast(acc_weights, sampled_logits.dtype)
-
sampled_logits += sparse_ops.sparse_to_dense(
-
sparse_indices,
-
sampled_logits_shape,
-
acc_weights,
-
default_value=
0.0,
-
validate_indices=
False)
-
-
if subtract_log_q:
-
# Subtract log of Q(l), prior probability that l appears in sampled.
-
true_logits -= math_ops.log(true_expected_count)
-
sampled_logits -= math_ops.log(sampled_expected_count)
-
-
# Construct output logits and labels. The true labels/logits start at col 0.
-
out_logits = array_ops.concat([true_logits, sampled_logits],
1)
-
-
# true_logits is a float tensor, ones_like(true_logits) is a float
-
# tensor of ones. We then divide by num_true to ensure the per-example
-
# labels sum to 1.0, i.e. form a proper probability distribution.
-
out_labels = array_ops.concat([
-
array_ops.ones_like(true_logits) / num_true,
-
array_ops.zeros_like(sampled_logits)
-
],
1)
-
-
return out_logits, out_labels
首先看一下开头注解的返回维数:
-
Returns:
-
out_logits:
`Tensor` object
with shape
-
`[batch_size, num_true + num_sampled]`,
for passing to either
-
`nn.sigmoid_cross_entropy_with_logits` (NCE) or
-
`nn.softmax_cross_entropy_with_logits_v2` (sampled softmax).
-
out_labels: A Tensor object
with the same shape
as
`out_logits`.
即 返回的out_logits和 out_labels的维度都是[batch_size, num_true + num_sampled],其中 num_true + num_sampled代表的就是正样本数+负样本数
再看一下最后:
out_labels = array_ops.concat([ array_ops.ones_like(true_logits) / num_true, array_ops.zeros_like(sampled_logits) ], 1)
其中的array_ops.ones_like和array_ops.zeros_like就是赋值向量为全1和全0,也可以从下面的源码看到:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/array_ops.py
-
@tf_export("ones_like")
-
def ones_like(tensor, dtype=None, name=None, optimize=True):
-
"""Creates a tensor with all elements set to 1.
-
Given a single tensor (`tensor`), this operation returns a tensor of the same
-
type and shape as `tensor` with all elements set to 1. Optionally, you can
-
specify a new type (`dtype`) for the returned tensor.
-
For example:
-
```python
-
tensor = tf.constant([[1, 2, 3], [4, 5, 6]])
-
tf.ones_like(tensor) # [[1, 1, 1], [1, 1, 1]]
-
```
-
Args:
-
tensor: A `Tensor`.
-
dtype: A type for the returned `Tensor`. Must be `float32`, `float64`,
-
`int8`, `uint8`, `int16`, `uint16`, `int32`, `int64`,
-
`complex64`, `complex128` or `bool`.
-
name: A name for the operation (optional).
-
optimize: if true, attempt to statically determine the shape of 'tensor'
-
and encode it as a constant.
-
Returns:
-
A `Tensor` with all elements set to 1.
-
"""
-
with ops.name_scope(name,
"ones_like", [tensor])
as name:
-
tensor = ops.convert_to_tensor(tensor, name=
"tensor")
-
ones_shape = shape_internal(tensor, optimize=optimize)
-
if dtype
is
None:
-
dtype = tensor.dtype
-
ret = ones(ones_shape, dtype=dtype, name=name)
-
if
not context.executing_eagerly():
-
ret.set_shape(tensor.get_shape())
-
return ret
所以总结一下就是:
out_logits返回的就是目标词汇
out_labels返回的就是正样本+负样本(其中正样本都标记为1,负样本都标记为0)
这也是负采样的精髓所在,因为结果只有两种结果,所以只做二分类就可以了,代替了之前需要预测整个词典的大小,比如要对本类子中的50000种结果的每一种都预测,所以减少了计算的复杂度!!!!!!!!!!!!!!!
二者的维度都是[batch_size, num_true + num_sampled]
同时因为该demo中sampled_values=None,所以
-
if sampled_values
is
None:
-
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
-
true_classes=labels,
-
num_true=num_true,
-
num_sampled=num_sampled,
-
unique=
True,
-
range_max=num_classes,
-
seed=seed)
用到的是:candidate_sampling_ops.log_uniform_candidate_sampler采样器
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/candidate_sampling_ops.py
-
def log_uniform_candidate_sampler(true_classes, num_true, num_sampled, unique,
-
range_max, seed=None, name=None):
-
"""Samples a
set
of classes
using a
log-
uniform (Zipfian) base distribution.
-
This operation randomly samples a tensor
of sampled classes
-
(
`sampled_candidates`)
from the
range
of integers
`[0, range_max)`.
-
The elements
of
`sampled_candidates`
are drawn
without replacement
-
(
if
`unique=True`)
or
with replacement (
if
`unique=False`)
from
-
the base distribution.
-
The base distribution
for this operation
is an approximately
log-
uniform
-
or Zipfian distribution:
-
`P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1)`
-
This sampler
is useful
when the target classes approximately follow such
-
a distribution -
for example,
if the classes represent words
in a lexicon
-
sorted
in decreasing
order
of frequency.
If your classes
are
not ordered
by
-
decreasing frequency,
do
not
use this op.
-
In addition, this operation
returns tensors
`true_expected_count`
-
and
`sampled_expected_count` representing the
number
of times
each
-
of the target classes (
`true_classes`)
and the sampled
-
classes (
`sampled_candidates`)
is expected
to occur
in an average
-
tensor
of sampled classes. These
values correspond
to
`Q(y|x)`
-
defined
in [this
-
document](
http://www.tensorflow.org/extras/candidate_sampling.pdf).
-
If
`unique=True`,
then these
are post-rejection probabilities
and we
-
compute them approximately.
-
Args:
-
true_classes: A
`Tensor`
of
type
`int64`
and shape
`[batch_size,
-
num_true]`. The target classes.
-
num_true: An
`int`. The
number
of target classes per training example.
-
num_sampled: An
`int`. The
number
of classes
to randomly sample.
-
unique: A
`bool`. Determines whether all sampled classes
in a batch
are
-
unique.
-
range_max: An
`int`. The
number
of possible classes.
-
seed: An
`int`. An operation-specific seed.
Default
is
0.
-
name: A
name
for the operation (optional).
-
Returns:
-
sampled_candidates: A tensor
of
type
`int64`
and shape
`[num_sampled]`.
-
The sampled classes.
-
true_expected_count: A tensor
of
type
`float`. Same shape
as
-
`true_classes`. The expected counts
under the sampling distribution
-
of
each
of
`true_classes`.
-
sampled_expected_count: A tensor
of
type
`float`. Same shape
as
-
`sampled_candidates`. The expected counts
under the sampling distribution
-
of
each
of
`sampled_candidates`.
-
"""
-
seed1, seed2 = random_seed.get_seed(seed)
-
return gen_candidate_sampling_ops.log_uniform_candidate_sampler(
-
true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
-
seed2=seed2, name=name)
可以看到其对负样本是基于以下概率采样的,之所以不使用词频直接作为概率采用是因为如果这样的话,那么采取的负样本就都会是哪些高频词汇类如:and , of , i 等等,显然并不好。另一个极端就是使用词频的倒数,但是这对英文也没有代表性,根据mikolov写的一篇论文,实验得出的经验值是
这里的话没有用上面的公式,但是也使得其处于两个极端之间了:还是可以看出P(class) 是递减函数,即class越小,P(class)越大,class在本类中代表的是单词的编号,由(五)可以知道,词频越大,编号越小(NUK除外),所以词频高的还是容易被作采用作为负样本的!
P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1)
(2)接下来看一下sigmoid_cross_entropy_with_logits函数
-
def sigmoid_cross_entropy_with_logits( # pylint: disable=invalid-name
-
_sentinel=None,
-
labels=None,
-
logits=None,
-
name=None):
-
"""Computes sigmoid cross entropy given `logits`.
-
Measures the probability error in discrete classification tasks in which each
-
class is independent and not mutually exclusive. For instance, one could
-
perform multilabel classification where a picture can contain both an elephant
-
and a dog at the same time.
-
For brevity, let `x = logits`, `z = labels`. The logistic loss is
-
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
-
= z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))
-
= z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))
-
= z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))
-
= (1 - z) * x + log(1 + exp(-x))
-
= x - x * z + log(1 + exp(-x))
-
For x < 0, to avoid overflow in exp(-x), we reformulate the above
-
x - x * z + log(1 + exp(-x))
-
= log(exp(x)) - x * z + log(1 + exp(-x))
-
= - x * z + log(1 + exp(x))
-
Hence, to ensure stability and avoid overflow, the implementation uses this
-
equivalent formulation
-
max(x, 0) - x * z + log(1 + exp(-abs(x)))
-
`logits` and `labels` must have the same type and shape.
-
Args:
-
_sentinel: Used to prevent positional parameters. Internal, do not use.
-
labels: A `Tensor` of the same type and shape as `logits`.
-
logits: A `Tensor` of type `float32` or `float64`.
-
name: A name for the operation (optional).
-
Returns:
-
A `Tensor` of the same shape as `logits` with the componentwise
-
logistic losses.
-
Raises:
-
ValueError: If `logits` and `labels` do not have the same shape.
-
"""
-
# pylint: disable=protected-access
-
nn_ops._ensure_xent_args(
"sigmoid_cross_entropy_with_logits", _sentinel,
-
labels, logits)
-
# pylint: enable=protected-access
-
-
with ops.name_scope(name,
"logistic_loss", [logits, labels])
as name:
-
logits = ops.convert_to_tensor(logits, name=
"logits")
-
labels = ops.convert_to_tensor(labels, name=
"labels")
-
try:
-
labels.get_shape().merge_with(logits.get_shape())
-
except ValueError:
-
raise ValueError(
"logits and labels must have the same shape (%s vs %s)" %
-
(logits.get_shape(), labels.get_shape()))
-
-
# The logistic loss formula from above is
-
# x - x * z + log(1 + exp(-x))
-
# For x < 0, a more numerically stable formula is
-
# -x * z + log(1 + exp(x))
-
# Note that these two expressions can be combined into the following:
-
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
-
# To allow computing gradients at zero, we define custom versions of max and
-
# abs functions.
-
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
-
cond = (logits >= zeros)
-
relu_logits = array_ops.where(cond, logits, zeros)
-
neg_abs_logits = array_ops.where(cond, -logits, logits)
-
return math_ops.add(
-
relu_logits - logits * labels,
-
math_ops.log1p(math_ops.exp(neg_abs_logits)),
-
name=name)
可以看到其实最关键的就是下面这个公式:
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
其实z * -log(x) + (1 - z) * -log(1 - x)就是交叉熵,对的,没看错这个函数其实就是将输入先sigmoid再计算交叉熵
如上所示最后化简结果为:x - x * z + log(1 + exp(-x))
这里考虑到当x<0时exp(-x)有可能溢出,所以当x<0时有- x * z + log(1 + exp(x))
最后综合两种情况归纳出:
max(x, 0) - x * z + log(1 + exp(-abs(x)))
关于交叉熵的概念可以看一下:
https://blog.csdn.net/rtygbwwwerr/article/details/50778098
(3)最后看一下_sum_rows
-
def _sum_rows(x):
-
"""Returns a vector summing up each row of the matrix x."""
-
# _sum_rows(x) is equivalent to math_ops.reduce_sum(x, 1) when x is
-
# a matrix. The gradient of _sum_rows(x) is more efficient than
-
# reduce_sum(x, 1)'s gradient in today's implementation. Therefore,
-
# we use _sum_rows(x) in the nce_loss() computation since the loss
-
# is mostly used for training.
-
cols = array_ops.shape(x)[
1]
-
ones_shape = array_ops.stack([cols,
1])
-
ones = array_ops.ones(ones_shape, x.dtype)
-
return array_ops.reshape(math_ops.matmul(x, ones), [
-1])
这个相当简单了就是将矩阵的每一行都加起来,即根据上面的[batch_size, num_true + num_sampled],其实就是true loss与 sampled loss之和,即求batch_size中每一个example的总loss
到此tf.nn.nce_loss就结束了,这就是最重要的部分
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
从代码可以看出这里选择的 optimizer是GradientDescentOptimizer,然后就是通过normalized_embeddings = embeddings / norm
对word2vec矩阵进行了归一化
下面接着上面来看一下计算余弦相似度的部分
-
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
-
valid_dataset)
-
similarity = tf.matmul(
-
valid_embeddings, normalized_embeddings, transpose_b=
True)
这里很简单,这里就是抽取其中valid_dataset个词向量计算余弦相似度,余弦相似度就是通过两个向量的夹角的余弦值来衡量相似程度,显然当角度为0时,二者重合,最相近,此时其余弦值也最大为1,本demo中抽取的方式是从前100个词频最高的词中选取随机选取16个即:
-
valid_size =
16 # Random
set of words
to evaluate similarity
on.
-
valid_window =
100 # Only pick dev samples
in the head of the distribution.
-
valid_examples = np.random.choice(valid_window, valid_size,
replace=
False)
(九)建立会话
-
num_steps =
100001
-
-
with tf.Session(graph=graph)
as session:
-
# Open a writer to write summaries.
-
writer = tf.summary.FileWriter(FLAGS.log_dir, session.graph)
-
-
# We must initialize all variables before we use them.
-
init.run()
-
print(
'Initialized')
-
-
average_loss =
0
-
for step
in xrange(num_steps):
-
batch_inputs, batch_labels = generate_batch(batch_size, num_skips,
-
skip_window)
-
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
-
-
# Define metadata variable.
-
run_metadata = tf.RunMetadata()
-
-
# We perform one update step by evaluating the optimizer op (including it
-
# in the list of returned values for session.run()
-
# Also, evaluate the merged op to get all summaries from the returned "summary" variable.
-
# Feed metadata variable to session for visualizing the graph in TensorBoard.
-
_, summary, loss_val = session.run(
-
[optimizer, merged, loss],
-
feed_dict=feed_dict,
-
run_metadata=run_metadata)
-
average_loss += loss_val
-
-
# Add returned summaries to writer in each step.
-
writer.add_summary(summary, step)
-
# Add metadata to visualize the graph for the last run.
-
if step == (num_steps -
1):
-
writer.add_run_metadata(run_metadata,
'step%d' % step)
-
-
if step %
2000 ==
0:
-
if step >
0:
-
average_loss /=
2000
-
# The average loss is an estimate of the loss over the last 2000 batches.
-
print(
'Average loss at step ', step,
': ', average_loss)
-
average_loss =
0
-
-
# Note that this is expensive (~20% slowdown if computed every 500 steps)
-
if step %
10000 ==
0:
-
sim = similarity.eval()
-
for i
in xrange(valid_size):
-
valid_word = reverse_dictionary[valid_examples[i]]
-
top_k =
8
# number of nearest neighbors
-
nearest = (-sim[i, :]).argsort()[
1:top_k +
1]
-
log_str =
'Nearest to %s:' % valid_word
-
for k
in xrange(top_k):
-
close_word = reverse_dictionary[nearest[k]]
-
log_str =
'%s %s,' % (log_str, close_word)
-
print(log_str)
-
final_embeddings = normalized_embeddings.eval()
-
-
# Write corresponding labels for the embeddings.
-
with open(FLAGS.log_dir +
'/metadata.tsv',
'w')
as f:
-
for i
in xrange(vocabulary_size):
-
f.write(reverse_dictionary[i] +
'\n')
-
-
# Save the model for checkpoints.
-
saver.save(session, os.path.join(FLAGS.log_dir,
'model.ckpt'))
-
-
# Create a configuration for visualizing embeddings with the labels in TensorBoard.
-
config = projector.ProjectorConfig()
-
embedding_conf = config.embeddings.add()
-
embedding_conf.tensor_name = embeddings.name
-
embedding_conf.metadata_path = os.path.join(FLAGS.log_dir,
'metadata.tsv')
-
projector.visualize_embeddings(writer, config)
-
-
writer.close()
这部分源码简单易懂主要依次做了以下几件事:
(1)训练模型
(2)在训练最后一次,保存模型用以后续可视化
(3)每2000次,计算一次平均loss
(4)每10000次,打印(八)中随机选取16个词各自对应的与其最相近的8个词
(5)保存训练好的embeddings矩阵为.tsv格式用于在tensorboard通过降维来可视化
(6)保存模型为.ckpt
(十)二维可视化
-
def plot_with_labels(low_dim_embs, labels, filename):
-
assert low_dim_embs.shape[
0] >= len(labels),
'More labels than embeddings'
-
plt.figure(figsize=(
18,
18))
# in inches
-
for i, label
in enumerate(labels):
-
x, y = low_dim_embs[i, :]
-
plt.scatter(x, y)
-
plt.annotate(
-
label,
-
xy=(x, y),
-
xytext=(
5,
2),
-
textcoords=
'offset points',
-
ha=
'right',
-
va=
'bottom')
-
-
plt.savefig(filename)
-
-
-
try:
-
# pylint: disable=g-import-not-at-top
-
from sklearn.manifold
import TSNE
-
import matplotlib.pyplot
as plt
-
-
tsne = TSNE(
-
perplexity=
30, n_components=
2, init=
'pca', n_iter=
5000, method=
'exact')
-
plot_only =
500
-
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
-
labels = [reverse_dictionary[i]
for i
in xrange(plot_only)]
-
plot_with_labels(low_dim_embs, labels, os.path.join(FLAGS.log_dir,
'tsne.png',
'tsne.png'))
-
-
except ImportError
as ex:
-
print(
'Please install sklearn, matplotlib, and scipy to show embeddings.')
-
print(ex)
这里其实归结起来就是做了一件事那就是:将训练好的embeddings降为2维,然后保存为图片形式用于可视化
这里只选择了embeddings前500个用于可视化,重点就是使用TSNE这个包用于降为
这里简单说一下参数:
perplexity为浮点型,可选(默认:30)较大的数据集通常需要更大的perplexity
n_components 降为几维,默认2
init 可选嵌入的初始化,默认值:“random”,这里选取pca是因为其通常比随机初始化更全局稳定。但需要注意的是pca初始化不 能用于预先计算的距离
n_iter 优化的最大迭代次数
method 梯度计算算法使用在O(NlogN)时间内运行的Barnes-Hut近似值。 method ='exact'将运行在O(N ^ 2)时间内较慢但精确的算法上。当最近邻的误差需要好于3%时,应该使用精确的算法。但是,确切的方法无法扩展到数百万个示例。0.17新版功能:通过Barnes-Hut近似优化方法。
更多参数可以参考:
https://blog.csdn.net/qq_23534759/article/details/80457557
到这里源码讲解完毕,下面是运行结果的部分截图:
可以看到词频最高的几个词以及其词频分别为:
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
然后是去了文章十个单词以及其对应的编号:
Sample data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
然后是取了一个批次,其大小为8,打印了其目标词汇以及上下文:
3081 originated -> 5234 anarchism
3081 originated -> 12 as
12 as -> 6 a
12 as -> 3081 originated
6 a -> 195 term
6 a -> 12 as
195 term -> 2 of
195 term -> 6 a
从这里也可以看到每个目标词汇被用了2次
接下来就是一次次的迭代,我们看最后一次的:
可以看到最和from接近的词汇有into, in, at, through, near, upanija, wct, polynomial
和five接近的词汇有 four, three, seven, six, eight, two, zero, nine等等
总之结果还是比较好的
看一下上图可视化的二维的图片,比如红圈都是助动词
再来看一下tensorboard中的可视化
###################################################################################################
以上就是word2vec的Skip-Gram模型,下面介绍CBOW
对比以上Skip-Gram模型:
当CBOW_window=1时:
I am looking for the missing glass-shoes who has picked it up .............
batch:[ ' i ' , ' looking '] , [ ' am ' , ' for '] , [ ' looking ' , ' the '] , [ ' for ' , ' missing '] .................
labels: [ ' am ' , ' looking ' , ' for ' , ' the ' ]
所以首先要改的就是将上面的(六)(八)部分改为:
-
def generate_batch(batch_size, cbow_window):
-
global data_index
-
assert cbow_window %
2 ==
1
-
span =
2 * cbow_window +
1
-
# 去除中心word: span - 1
-
batch = np.ndarray(shape=(batch_size, span -
1), dtype=np.int32)
-
labels = np.ndarray(shape=(batch_size,
1), dtype=np.int32)
-
-
buffer = collections.deque(maxlen=span)
-
for _
in range(span):
-
buffer.append(data[data_index])
-
# 循环选取 data中数据,到尾部则从头开始
-
data_index = (data_index +
1) % len(data)
-
-
for i
in range(batch_size):
-
# target at the center of span
-
target = cbow_window
-
# 仅仅需要知道context(word)而不需要word
-
target_to_avoid = [cbow_window]
-
-
col_idx =
0
-
for j
in range(span):
-
# 略过中心元素 word
-
if j == span //
2:
-
continue
-
batch[i, col_idx] = buffer[j]
-
col_idx +=
1
-
labels[i,
0] = buffer[target]
-
# 更新 buffer
-
buffer.append(data[data_index])
-
data_index = (data_index +
1) % len(data)
-
-
return batch, labels
-
-
-
-
batch, labels = generate_batch(batch_size=
8, cbow_window=
1)
-
for i
in range(
8):
-
print(reverse_dictionary[batch[i,
0]],
'and',reverse_dictionary[batch[i,
1]] ,
'->',
-
reverse_dictionary[labels[i,
0]])
-
with graph.as_default():
-
-
# Input data.
-
with tf.name_scope(
'inputs'):
-
-
train_dataset = tf.placeholder(tf.int32, shape=[batch_size,
2 * cbow_window])
-
train_labels = tf.placeholder(tf.int32, shape=[batch_size,
1])
-
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
-
-
-
-
# Ops and variables pinned to the CPU because of missing GPU implementation
-
with tf.device(
'/cpu:0'):
-
# Look up embeddings for inputs.
-
with tf.name_scope(
'embeddings'):
-
embeddings = tf.Variable(
-
tf.random_uniform([vocabulary_size, embedding_size],
-1.0,
1.0))
-
-
-
# Construct the variables for the NCE loss
-
with tf.name_scope(
'weights'):
-
nce_weights = tf.Variable(
-
tf.truncated_normal(
-
[vocabulary_size, embedding_size],
-
stddev=
1.0 / math.sqrt(embedding_size)))
-
with tf.name_scope(
'biases'):
-
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
-
-
-
-
embeds =
None
-
for i
in range(
2 * cbow_window):
-
embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])
-
print(
'embedding %d shape: %s'%(i, embedding_i.get_shape().as_list()))
-
emb_x,emb_y = embedding_i.get_shape().as_list()
-
if embeds
is
None:
-
-
embeds = tf.reshape(embedding_i, [emb_x,emb_y,
1])
-
else:
-
embeds = tf.concat([embeds, tf.reshape(embedding_i, [emb_x, emb_y,
1])],
2)
-
print(
"Concat embedding size: %s"%embeds.get_shape().as_list())
-
avg_embed = tf.reduce_mean(embeds,
2, keep_dims=
False)
-
print(
"Avg embedding size: %s"%avg_embed.get_shape().as_list())
-
print(
'--------------------------------------------------------------------------------------------')
-
print(avg_embed.shape)
-
print(train_labels.shape)
-
print(
'--------------------------------------------------------------------------------------------')
-
# Compute the average NCE loss for the batch.
-
# tf.nce_loss automatically draws a new sample of the negative labels each
-
# time we evaluate the loss.
-
# Explanation of the meaning of NCE loss:
-
# http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
-
with tf.name_scope(
'loss'):
-
loss = tf.reduce_mean(
-
tf.nn.nce_loss(
-
weights=nce_weights,
-
biases=nce_biases,
-
labels=train_labels,
-