[翻译]斯坦福CS 20SI:基于Tensorflow的深度学习研究课程笔记,Lecture note 4: How to structure your model in TensorFlow

最新推荐文章于 2024-04-26 16:19:49 发布

wangyuweihx

最新推荐文章于 2024-04-26 16:19:49 发布

阅读量4.8k

点赞数

文章标签：深度学习 tensorflow 斯坦福课程笔记

“CS 20SI: TensorFlow for Deep Learning Research”

Prepared by Chip Huyen
Reviewed by Danijar Hafner

Lecture note 4: How to structure your model in TensorFlow
个人翻译，部分内容较简略，建议参考原note阅读

本节课建立word2vec模型,不熟悉可以阅读CS224N的课件 Mikolov的原始论文

Skip - gram模型 vs CBOW模型( Continuous Bag - of - Words):

算法上是相似的,不同的是CBOW根据上下文预测中心词,Skip-gram正好相反,统计上来讲,C通过把全部上下文当做一个观测值而使大量分布信息更为平滑,对于比较小的数据集这种方法很有用,S把(上下文-目标)对当做一个观测值,这使数据集更大的时候更有效

Word2Vec Tutorial

建立skip-gram模型,我们更关心隐藏层的权重,权重是我们所尝试学习的,也叫词向量矩阵

如何建造 tensorflow 模型

阶段1: 建造图:
1. 定义输入输出的占位符
2. 定义权重
3. 定义模型inference
4. 定义损失函数
5. 定义optimizer
阶段2:执行计算:
1. 给第一次执行初始变量
2. feed训练数据,可能需要随机化数据样本
3. 在训练数据下执行模型inference,计算当前输入和当前模型参数的输出
4. 计算损失
5. 通过最小/大化模型损失调整参数

让我们根据这些步骤穿件w2v,Skip-gram模型:

阶段1: 构造图:

定义输入输出的占位符
输入中心词,输出目标词,使用词列表索引代替one-hot向量,[batch_size]个标量输入和输出

center_words  =  tf.placeholder( tf.int32 ,  shape =[ BATCH_SIZE ]) target_words  =  tf.placeholder( tf.int32 ,  shape =[ BATCH_SIZE ])

定义权重(词向量矩阵)
每行表示一个单词的词向量,每个词向量长度为EMBED_SIZE,那么词向量矩阵就是[VOCAB_SIZE,EMBED_SIZE],我们使用均匀分布初始化词向量矩阵

  embed_matrix  =  tf.Variable(tf.random_uniform([ VOCAB_SIZE ,  EMBED_SIZE ],   - 1.0 ,   1.0 ))

inference(图前向传播通路)

tf.nn.embedding_lookup( params ,  ids ,  partition_strategy = 'mod' ,  name = None , validate_indices = True ,  max_norm = None)

使用以上方法可以通过词向量矩阵转换单词索引为词向量

embed  =  tf.nn.embedding_lookup( embed_matrix ,  center_words)

4 定义损失函数
NCE用py实现很复杂,tf自带的实现:

tf.nn.nce_loss( weights ,  biases ,  labels ,  inputs ,  num_sampled ,  num_classes ,  num_true = 1 , sampled_values = None ,  remove_accidental_hits = False ,  partition_strategy = 'mod' , name = 'nce_loss')

我们需要隐藏层的权重和偏差去计算NCE损失

nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
                                                stddev=1.0 / (EMBED_SIZE ** 0.5)), 
                                                name='nce_weight')
    nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

然后定义损失:

loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                        biases=nce_bias, 
                                        labels=target_words, 
                                        inputs=embed, 
                                        num_sampled=NUM_SAMPLED, 
                                        num_classes=VOCAB_SIZE), name='loss')

定义optimizer

optimizer  =  tf.train.GradientDescentOptimizer(LEARNING_RATE ).minimize(loss)

阶段2:执行计算

建立session,给占位符feed输入和输出,运行优化器最小化loss,取回loss值

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('./my_graph/no_frills/', sess.graph)
        for index in xrange(NUM_TRAIN_STEPS):
            centers, targets = batch_gen.next()
            loss_batch, _ = sess.run([loss, optimizer], 
                                    feed_dict={center_words: centers, target_words: targets})
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
        writer.close()

nameScope

給张量命名并且在tensorboard中观察:
这里写图片描述
看起来乱七八糟的

使用tf.name_scope(name)将节点分类,节点会显示为一个scope,点击scope可以显示内部细节.你会发现tensorboard有节点两种线,一种是实线,另一种是虚线,实线代表数据流,虚线是节点所依赖的操作. 这里写图片描述
完整的节点图标

之前我们建立了一个简单的顺序模型,我们可以使用py的面向对象方法编写一个易于重用的模型,把Skip_gram建立为一个类:

class SkipGramModel:
    """ Build the graph for word2vec model """
    def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate):
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.batch_size = batch_size
        self.num_sampled = num_sampled
        self.lr = learning_rate
        self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

    def _create_placeholders(self):
        """ Step 1: define the placeholders for input and output """
        with tf.name_scope("data"):
            self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words')
            self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words')

    def _create_embedding(self):
        """ Step 2: define weights. In word2vec, it's actually the weights that we care about """
        # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU
        with tf.device('/cpu:0'):
            with tf.name_scope("embed"):
                self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size, 
                                                                    self.embed_size], -1.0, 1.0), 
                                                                    name='embed_matrix')

    def _create_loss(self):
        """ Step 3 + 4: define the model + the loss function """
        with tf.device('/cpu:0'):
            with tf.name_scope("loss"):
                # Step 3: define the inference
                embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed')

                # Step 4: define loss function
                # construct variables for NCE loss
                nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size],
                                                            stddev=1.0 / (self.embed_size ** 0.5)), 
                                                            name='nce_weight')
                nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

                # define loss function to be NCE loss function
                self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                                    biases=nce_bias, 
                                                    labels=self.target_words, 
                                                    inputs=embed, 
                                                    num_sampled=self.num_sampled, 
                                                    num_classes=self.vocab_size), name='loss')
    def _create_optimizer(self):
        """ Step 5: define optimizer """
        with tf.device('/cpu:0'):
            self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, 
                                                              global_step=self.global_step)

    def _create_summaries(self):
        with tf.name_scope("summaries"):
            tf.summary.scalar("loss", self.loss)
            tf.summary.histogram("histogram loss", self.loss)
            # because you have several summaries, we should merge them all
            # into one op to make it easier to manage
            self.summary_op = tf.summary.merge_all()

    def build_graph(self):
        """ Build the graph for our model """
        self._create_placeholders()
        self._create_embedding()
        self._create_loss()
        self._create_optimizer()
        self._create_summaries()

def train_model(model, batch_gen, num_train_steps, weights_fld):
    saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias

    initial_step = 0
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
        # if that checkpoint exists, restore from checkpoint
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph)
        initial_step = model.global_step.eval()
        for index in xrange(initial_step, initial_step + num_train_steps):
            centers, targets = batch_gen.next()
            feed_dict={model.center_words: centers, model.target_words: targets}
            loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op], 
                                              feed_dict=feed_dict)
            writer.add_summary(summary, global_step=index)
            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
                saver.save(sess, 'checkpoints/skip-gram', index)

通过t-SNE我们可以把我们的词向量矩阵可视化,我们可以看到所有的数字被分类到右下角一行,挨着字母和名字,所有月份被分类到一组,所有”do,does,did”被分为一组等等.
这里写图片描述

如果你打印’American’接近的词: 这里写图片描述

t - SNE（来自维基百科）
t - 分布随机相邻嵌入（t-SNE）是一种由Geoffrey Hinton和Laurens van der Maaten开发的机器学习降维算法。它是一种非线性降维技术，其特别适合将高维数据转换到二维或三维空间，然后以散点图可视化。具体来说，它通过一个二或三维空间来模拟每个高维对维点，使得类似对象由附近的点和不相似的对象的远近建模。t-SNE算法包括两个主要阶段。首先，t-SNE构建类似对象具有被选择的高概率，而不相似的点具有非常小被挑选的概率。第二，t-SNE定义了类似的概率分布低维地图中的点，并且在相对于地图中的点的位置的两个分布之间最小化Kullback-Leibler散度

t-SNE使MNIST数据集可视化

我们也可以使用PCA使数据可视化

我们可以用不到十行的代码使数据可视化,tensorboard提供了很好的工具:

from tensorflow.contrib.tensorboard.plugins import projector
# obtain the embedding_matrix after you’ve trained it
final_embed_matrix = sess . run ( model . embed_matrix)
# create a variable to hold your embeddings. It has to be a variable. Constants
# don’t work. You also can’t just use the embed_matrix we defined earlier for our model. Why
# is that so? I don’t know. I get the 500 most popular words.
embedding_var = tf.Variable(final_embed_matrix[:500], name='embedding')
sess.run(embedding_var.initializer)
config = projector.ProjectorConfig()
summary_writer = tf.summary.FileWriter(LOGDIR)
# add embeddings to config
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# link the embeddings to their metadata file. In this case, the file that contains
# the 500 most popular words in our vocabulary
embedding.metadata_path = LOGDIR + '/vocab_500.tsv'
# save a configuration file that TensorBoard will read during startup
projector.visualize_embeddings(summary_writer, config)
# save our embedding
saver_embed = tf.train.Saver([embedding_var])
saver_embed.save(sess, LOGDIR + '/skip-gram.ckpt', 1)

为什么我们依然要了解梯度

虽然目前为止我们建立的模型都没有获取单个节点的梯度,因为tf会自动考虑反向传播,但是我们依然要学会如何获取梯度,因为tf并不能分辨梯度消失或者梯度爆炸的情况,我们需要了解模型的梯度去获知模型是否正常工作