我的tensorflow学习笔记（6）：word2vec,variable sharing

本文链接：https://blog.csdn.net/qq_35370018/article/details/79475932

一、如何代表文字数据

one-hot representation：由0和1构成，容易导致过大的维度，计算低效。并且无法识别语义内容，如近义词。

word embedding（词嵌入）:根据一个词的上下文去代表一个单词。相比上面的one-hot，有维度低、连续向量、能够识别语义信息等优点。参考知乎：有谁可以解释一下word embedding?

word2vec是由Tomas Mikolv带领团队创建的一组用来处理word embedding的模型。其中主要有两种模型：skip-gram和CBOW，前者根据中心词预测上下文，后者根据上下文预测中心词。

二、建立skip-gram模型

我们的任务是训练模型，使它能够根据中心词预测上下文。给定一句话的中心词，随机在前后挑选一个单词，然后网络告诉你这个词是我们的字母表中某一个词的可能性。

为了获取上下文单词的概率分布，通常我们使用softmax，它将任意值xi变换到一个概率分布pi。在这种情况下softmax(xi)就是一个临近单词是字母表中某一单词的可能性。但softmax意味着指数运算，这严重限制了计算速度。通常有两种方法解决这一瓶颈： hierarchical softmax和sample-based softmax。Mikolv在他的论文中指出，对于skip-gram模型的训练，negative sample的效果表现得比hierarchical softmax更好。negative sampling实际上是Noise Contrastive Estimation（NCE）的一种简化版本，但只有NCE能保证是对softmax的近似。注意，它们只能减少训练时间，softmax仍需要完整计算以获得归一化的概率分布。

三、Variable Sharing

1、Name Scope

为了便于在tensorboard中观察图和节点，我们需要使用name scope对节点进行分组。

with tf.name_scope(name_of_that_scope):
	# declare op_1
	# declare op_2
	# ...

分组后的图如上所示。在tensorboard中，有三种线：实心灰色箭头，实心橙色箭头，虚线灰色箭头。实心灰色箭头表示数据流向，实心橙色箭头表示节点的控制，在图中optimizer可以改变nce_weight、nce_bias和embed_matrix，虚线灰色箭头表示控制依赖关系，如nce_weight必须在init后才能使用。

2、Variable Scope

建立一个两层的神经网络，输入两组不同的数据x1和x2：

x1 = tf.truncated_normal([200, 100], name='x1')
x2 = tf.truncated_normal([200, 100], name='x2')

def two_hidden_layers(x):
    assert x.shape.as_list() == [200, 100]
    w1 = tf.Variable(tf.random_normal([100, 50]), name="h1_weights")
    b1 = tf.Variable(tf.zeros([50]), name="h1_biases")
    h1 = tf.matmul(x, w1) + b1
    assert h1.shape.as_list() == [200, 50]  
    w2 = tf.Variable(tf.random_normal([50, 10]), name="h2_weights")
    b2 = tf.Variable(tf.zeros([10]), name="h2_biases")
    logits = tf.matmul(h1, w2) + b2
    return logits

logits1 = two_hidden_layers(x1)
logits2 = two_hidden_layers(x2)

使用tensorboard可视化：

每次调用这个神经网络，tensorflow就会重新生成一组变量，但实际上你希望无论多少输入都共享同一组变量。首先需要使用tf.get_variable()，在生成变量前，它会检查是否已经存在，如果存在就重复使用，否则生成新变量。其次我们需要使用Varscope，并使Varscope可以重复使用。

with tf.variable_scope('two_layers') as scope:
    logits1 = two_hidden_layers_2(x1)
    scope.reuse_variables()
    logits2 = two_hidden_layers_2(x2)
def fully_connected(x, output_dim, scope):
    with tf.variable_scope(scope) as scope:
        w = tf.get_variable("weights", [x.shape[1], output_dim], initializer=tf.random_normal_initializer())
        b = tf.get_variable("biases", [output_dim], initializer=tf.constant_initializer(0.0))
        return tf.matmul(x, w) + b

def two_hidden_layers(x):
    h1 = fully_connected(x, 50, 'h1')
    h2 = fully_connected(h1, 10, 'h2')

with tf.variable_scope('two_layers') as scope:
    logits1 = two_hidden_layers(x1)
    scope.reuse_variables()
    logits2 = two_hidden_layers(x2)

重新可视化就会得到我们希望的结果。

四、word2vec源码

""" starter code for word2vec skip-gram model with NCE loss
CS 20: "TensorFlow for Deep Learning Research"
cs20.stanford.edu
Chip Huyen (chiphuyen@cs.stanford.edu)
Lecture 04
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

import utils
import word2vec_utils

# Model hyperparameters
VOCAB_SIZE = 50000
BATCH_SIZE = 128
EMBED_SIZE = 128            # dimension of the word embedding vectors
SKIP_WINDOW = 1             # the context window
NUM_SAMPLED = 64            # number of negative examples to sample
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 100000
VISUAL_FLD = 'visualization'
SKIP_STEP = 5000

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'
EXPECTED_BYTES = 31344016
NUM_VISUALIZE = 3000        # number of tokens to visualize


def word2vec(dataset):
    """ Build the graph for word2vec model and train it """
    # Step 1: get input, output from the dataset
    with tf.name_scope('data'):
        iterator = dataset.make_initializable_iterator()
        center_words, target_words = iterator.get_next()

    """ Step 2 + 3: define weights and embedding lookup.
    In word2vec, it's actually the weights that we care about 
    """
    with tf.name_scope('embed'):
        embed_matrix = tf.get_variable('embed_matrix', 
                                        shape=[VOCAB_SIZE, EMBED_SIZE],
                                        initializer=tf.random_uniform_initializer())
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embedding')

    # Step 4: construct variables for NCE loss and define loss function
    with tf.name_scope('loss'):
        nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE],
                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))
        nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

        # define loss function to be NCE loss function
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                            biases=nce_bias, 
                                            labels=target_words, 
                                            inputs=embed, 
                                            num_sampled=NUM_SAMPLED, 
                                            num_classes=VOCAB_SIZE), name='loss')

    # Step 5: define optimizer
    with tf.name_scope('optimizer'):
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
    
    utils.safe_mkdir('checkpoints')

    with tf.Session() as sess:
        sess.run(iterator.initializer)
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph)

        for index in range(NUM_TRAIN_STEPS):
            try:
                loss_batch, _ = sess.run([loss, optimizer])
                total_loss += loss_batch
                if (index + 1) % SKIP_STEP == 0:
                    print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                    total_loss = 0.0
            except tf.errors.OutOfRangeError:
                sess.run(iterator.initializer)
        writer.close()

def gen():
    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, 
                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)

def main():
    dataset = tf.data.Dataset.from_generator(gen, 
                                (tf.int32, tf.int32), 
                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))
    word2vec(dataset)

if __name__ == '__main__':
    main()