tensorflow训练word2vec

最新推荐文章于 2022-04-11 16:30:05 发布

MIIEo

最新推荐文章于 2022-04-11 16:30:05 发布

阅读量235

点赞数

分类专栏： Code2Vec

本文链接：https://blog.csdn.net/qq_29421241/article/details/112105284

版权

Code2Vec 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

train_skip_gram()

以 文件形式存在的数据 和 代码中的变量 的对应关系

embeddings = train_skip_gram(vocabulary_size, 
                                 data_folder, 
                                 data_folders, 
                                 num_data_pairs, 
                                 reverse_dictionary,
                                 param, 
                                 valid_examples, 
                                 log_dir, 
                                 v_metadata_file_name, 
                                 embeddings_pickle,
                                 ckpt_saver_file, 
                                 ckpt_saver_file_init, 
                                 ckpt_saver_file_final,
                                 restore_tf_variables_from_ckpt)

参数解析

vocabulary_size

    # Get dictionary and vocabulary
    print('\n\tGetting dictionary ...')
    folder_vocabulary = os.path.join(data_folder, 'vocabulary')
    dictionary_pickle = os.path.join(folder_vocabulary, 'dic_pickle')
    with open(dictionary_pickle, 'rb') as f:
        dictionary = pickle.load(f)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    del dictionary
    vocabulary_size = len(reverse_dictionary.keys())

python中处理文件的API，大多以文件的绝对路径作为函数参数。
上述代码含义，读 data_folder/vocabulary/dic_pickle 文件，求个len，赋值给 vocabulary_size
with as 用法的参考博客
with as 的例子

from absl import app

class Sample:
    def __init__(self):
        print("In __init__()")

    def __enter__(self):
        print("In __enter__()")
        return "Foo"

    def __exit__(self, type, value, trace):
        print("In __exit__()")


def get_sample():
    return Sample()


def main(argv):
    del argv  # 无用
    with get_sample() as sample:
        print("sample:", sample)


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    app.run(main)  # 和tf.app.run()类似

上述代码的执行结果：

In __init__()
In __enter__()
sample: Foo
In __exit__()

data_folder & data_folders

data_folder 的值是 "data"
data_folders 是construct_xfg()的返回值

num_data_pairs

以data文件夹内的BLAS-3.8.0为例，BLAS-3.8.0文件夹中含有.ll的上级文件夹的只有blas自己，因此，在BLAS-3.8.0这个文件夹内的blas_dataset_cw_2文件夹内的data_pairs_cw_2.rec文件，是和num_data_pairs对应的原始文件，该文件是二进制文件，文本编辑器无法阅读
cw是context_width的缩写
2是指context_width这个参数等于2

reverse_dictionary

reverse_dictionary 其实是和 vocabulary_size 同时产生的，参见vocabulary_size 处的代码

param

param的数据类型是字典，key为FLAGS中的key，value为FLAGS[k].value
参考博客

valid_examples

    # Validation set used to sample nearest neighbors
    # Limit to the words that have a low numeric ID,
    # which by construction are also the most frequent.
    valid_size = 30    # Random set of words to evaluate similarity on.
    valid_window = 50  # Only pick dev samples in the head of the distribution.
    valid_examples = np.random.choice(valid_window, valid_size, replace=False)

关于参数replace，用来设置是否可以取相同元素：

True表示可以取相同数字；
False表示不可以取相同数字。
默认是True

np.random.choice(50, 30, replace=False)
从0到49这50个数中，选30个数，不可以取相同的数字
np.random.choice(50, 30, replace=True)
从0到49这50个数中，选30个数，可以取相同的数字
np.random.choice的详细参考博客

log_dir

log_dir字符串的值对应的文件夹：data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5
log_dir就是日志文件夹

v_metadata_file_name

这个文件没有找到

embeddings_pickle

对应的文件：data/emb/emb_cw_2_embeddings/emb__data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5.p

ckpt_saver_file

对应的字符串：data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec.ckpt

ckpt_saver_file_init

对应的字符串：data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec-init.ckpt

ckpt_saver_file_final

对应的字符串：data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec-final.ckpt

restore_tf_variables_from_ckpt

类型为bool型：False

函数功能

训练模型（跳字模型）

函数流程

Extract parameters from dictionary
Set up for analogies
Read data using TensorFlow’s data API
Tensorflow computaional graph
- Placeholders for inputs
- (input) Embedding matrix
- Normalized embedding matrix
- (output) Embedding matrix (“output weights”)
- Optimization
Validation block
Summaries
Misc.
Training

Read data using TensorFlow’s data API

# Read data using Tensorflow's data API
    data_files = get_data_pair_files(data_folders, context_width)
    print('\ttraining with data from files:', data_files)
    with tf.name_scope("Reader") as scope:

        random.shuffle(data_files)
        dataset_raw = tf.data.FixedLengthRecordDataset(filenames=data_files,
                                                       record_bytes=8)  # <TFRecordDataset shapes: (), types: tf.string>
        dataset = dataset_raw.map(record_parser)
        
        dataset = dataset.shuffle(int(1e5))
        dataset_batched = dataset.apply(tf.contrib.data.batch_and_drop_remainder(mini_batch_size))
        dataset_batched = dataset_batched.prefetch(int(100000000))
        iterator = dataset_batched.make_initializable_iterator()
        saveable_iterator = tf.contrib.data.make_saveable_from_iterator(iterator)
        next_batch = iterator.get_next()  # Tensor("Shape:0", shape=(2,), dtype=int32)

Returns:
  Dataset: A `Dataset`.

Tensorflow computaional graph

Placeholders for inputs

    # Placeholders for inputs
    with tf.name_scope("Input_Data") as scope:
        train_inputs = next_batch[:, 0]
        train_labels = tf.reshape(next_batch[:, 1], shape=[mini_batch_size, 1], name="training_labels")

此处withas的作用就记住便于tensorboard展示计算图就好了
用不用withas对整个训练没有任何影响参考博客

train_inputs是个啥玩意？啥形状？
train_inputs的形状仅根据next_batch我并没有看出来，但是结合后边的代码，train_inputs应该是vocabulary中每个单词的one-hot编码

(input) Embedding matrix

    # (input) Embedding matrix
    with tf.name_scope("Input_Layer") as scope:
        W_in = tf.Variable(tf.random_uniform([V, N], -1.0, 1.0), name="input-embeddings")

        # Look up the vector representing each source word in the batch (fetches rows of the embedding matrix)
        h = tf.nn.embedding_lookup(W_in, train_inputs, name="input_embedding_vectors")

tf.random_uniform([V, N], -1.0, 1.0)
V是词汇表大小，也就是函数train_skip_gram的第一个参数，vocabulary_size
N是最后生成的词向量的长度，在本例中是200
tf.random_uniform应该是随机生成-1到1之间的浮点数，根据uniform看应该是生成的浮点数服从均匀分布。

Normalized embedding matrix

    # Normalized embedding matrix
    with tf.name_scope("Embeddings_Normalized") as scope:
        normalized_embeddings = tf.nn.l2_normalize(W_in, name="embeddings_normalized")

此处是对W_in进行l2正则化处理，W_in是一个V*N的随机矩阵，元素取值为-1到1之间的浮点数
正则化的公式，正则化的好处，请参考博客，我还不清楚。

(output) Embedding matrix (“output weights”)

    # (output) Embedding matrix ("output weights")
    with tf.name_scope("Output_Layer") as scope:
        if FLAGS.softmax:
            W_out = tf.Variable(tf.truncated_normal([N, V], stddev=1.0 / math.sqrt(N)), name="output_embeddings")
        # Biases between hidden layer and output layer
        b_out = tf.Variable(tf.zeros([V]), name="nce_bias")

FLAGS.softmax是bool型，值为True
这个过程干啥的，不知道。咋地，难道是隐藏层的权重？也就是模型的参数？
哦哦，模型的参数是W_out和b_out，一个是weight，一个是bias

Optimization

    # Optimization
    with tf.name_scope("Optimization_Block") as scope:
        # Loss function
        # FLAGS.softmax是bool型，值为`True`
        if True:
            # dense层是神经网络的最后一层，也就是输出层，input是输入维度（但h是个矩阵呐，咋回事），units是输出维度
            logits = tf.layers.dense(inputs=h, units=V)
            # 对train_labels进行one-hot编码
            onehot = tf.one_hot(train_labels, V)
            # logits是预测值，onehot是真实值
            loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=onehot, logits=logits)
        # 这个地方是不是缩进有错误，我不是很清楚【缩进没错】
        train_loss = tf.reduce_mean(loss_tensor, name="nce_loss")

        # Regularization (optional)
        # l2_reg_scale的值为0.0
        # l2_reg_scale的含义：scale of L2 regularization applied to weights (0: no regularization)
        # 这么看，上边缩进就没错
        if False:
            
        else:
            loss = train_loss

        # Optimizer
        # FLAGS.optimizer的值是adam
        if FLAGS.optimizer == 'adam':
            optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

    if FLAGS.optimizer != 'momentum':
        global_train_step = tf.Variable(0, trainable=False, dtype=tf.int32, name="global_step")

FLAGS.softmax是bool型，值为True

Validation block

    # Validation block
    # valid_examples是0到49这50个整数中，随机选取的不相同的30个数
    with tf.name_scope("Validation_Block") as scope:
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32, name="validation_data_size")
        valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
        cosine_similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

Summaries

    # Summaries
    with tf.name_scope("Summaries") as scope:
        tf.summary.histogram("input_embeddings", W_in)
        tf.summary.histogram("input_embeddings_normalized", normalized_embeddings)
        tf.summary.histogram("output_embeddings", W_out)
        tf.summary.scalar("nce_loss", loss)

        analogy_score_tensor = tf.Variable(0, trainable=False, dtype=tf.int32, name="analogy_score")
        tf.summary.scalar("analogy_score", analogy_score_tensor)

Misc

    # Misc.
    restore_completed = False
    init = tf.global_variables_initializer()        # variables initializer
    summary_op = tf.summary.merge_all()             # merge summaries into one operation

Training

####################################################################################################################
    # Training
    with tf.Session(config=config) as sess:

        # Add TensorBoard components
        writer = tf.summary.FileWriter(log_dir)  # create summary writer
        writer.add_graph(sess.graph)
        gvars = [gvar for gvar in tf.global_variables() if 'analogy_score' not in gvar.name]
        saver = tf.train.Saver(gvars, max_to_keep=5)  # create checkpoint saver
        config = projector.ProjectorConfig()  # create projector config
        embedding = config.embeddings.add()  # add embeddings visualizer
        embedding.tensor_name = W_in.name
        embedding.metadata_path = vocab_metada_file  # link metadata
        projector.visualize_embeddings(writer, config)  # add writer and config to projector

        # Set up variables
        graph_saver = tf.train.Saver(allow_empty=True)
        init.run()
        graph_saver.save(sess, ckpt_saver_file_init, global_step=0, write_meta_graph=True)
        tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable_iterator)
        print("\tVariables initialized in TensorFlow")

        # Compute the necessary number of steps for this epoch as well as how often to print the avg loss
        num_steps = int(math.ceil(dataset_size / mini_batch_size))
        step_print_loss = int(math.ceil(num_steps / freq_print_loss))
        print('\tPrinting loss every ', step_print_loss, 'steps, i.e.', freq_print_loss, 'times per epoch')

        ################################################################################################################
        # Epoch loop
        epoch = 0
        global_step = 0
        while epoch < int(num_epochs):
            print('\n\tStarting epoch ', epoch)
            sess.run(iterator.initializer)      # initialize iterator
            ############################################################################################################
            # Loop over steps (mini batches) inside of epoch
            step = 0
            avg_loss = 0
            while True:

                try:

                    # Print average loss every x steps
                    if step_print_loss > 0 and step % int(step_print_loss) == 0:    # update step with logging

                        # If restoring a previous training session, set the right training epoch
                        if restore_variables and not restore_completed:
                            restore_completed = True

                        # Write global step
                        if True:
                            global_train_step.assign(global_step).eval()

                        # Perform an update
                        # print('\tStarting local step {:>6}'.format(step))  # un-comment for debugging
                        [_, loss_val, train_loss_val, global_step] = sess.run(
                            [optimizer, loss, train_loss, global_train_step], options=options,
                            run_metadata=metadata)
                        assert not np.isnan(loss_val), "Loss at step " + str(step) + " is nan"
                        assert not np.isinf(loss_val), "Loss at step " + str(step) + " is inf"
                        avg_loss += loss_val

                        if step > 0:
                            avg_loss /= step_print_loss

                        analogy_score = i2v_eval.evaluate_analogies(W_in.eval(), reverse_dictionary, analogies,
                                                                    analogy_types, analogy_evaluation_file,
                                                                    session=sess, print=i2v_eval.nop)
                        total_analogy_score = sum([a[0] for a in analogy_score])
                        analogy_score_tensor.assign(total_analogy_score).eval()  # for tf.summary

                        [summary, W_in_val] = sess.run([summary_op, W_in])

                        if FLAGS.savebest is not None:
                            filelist = [f for f in os.listdir(FLAGS.savebest)]
                            scorelist = [int(s.split('-')[1]) for s in filelist]
                            if len(scorelist) == 0 or total_analogy_score > sorted(scorelist)[-1]:
                                i2v_utils.safe_pickle(W_in_val, FLAGS.savebest + '/' + 'score-' +
                                                      str(total_analogy_score) + '-w.p')

                        # Display average loss
                        print('{} Avg. loss at epoch {:>6,d}, step {:>12,d} of {:>12,d}, global step {:>15} : {:>12.3f}, analogies: {})'.format(
                            str(datetime.now()), epoch, step, num_steps, global_step, avg_loss, str(analogy_score)))
                        avg_loss = 0

                        # Pickle intermediate embeddings
                        i2v_utils.safe_pickle(W_in_val, embeddings_pickle)

                        # Write to TensorBoard
                        saver.save(sess, ckpt_saver_file, global_step=global_step, write_meta_graph=False)
                        writer.add_summary(summary, global_step=global_step)

                        if step > 0 and FLAGS.extreme:
                            sys.exit(22)

                    else:   # ordinary update step
                        [_, loss_val] = sess.run([optimizer, loss])
                        avg_loss += loss_val

                    # Compute and print nearest neighbors every x steps
                    if step_print_neighbors > 0 and step % int(step_print_neighbors) == 0:
                        print_neighbors(op=cosine_similarity, examples=valid_examples, top_k=6,
                                        reverse_dictionary=reverse_dictionary)

                    # Update loop index (steps in epoch)
                    step += 1
                    global_step += 1

                except tf.errors.OutOfRangeError:

                    # We reached the end of the epoch
                    print('\n\t Writing embeddings to file ', embeddings_pickle)
                    i2v_utils.safe_pickle([W_in.eval()], embeddings_pickle)                   # WEIRD!
                    epoch += 1      # update loop index (epochs)
                    break           # from this inner loop

        ################################################################################################################
        # End of training:
        # Print the nearest neighbors at the end of the run
        if step_print_neighbors == -1:
            print_neighbors(op=cosine_similarity, examples=valid_examples, top_k=6,
                            reverse_dictionary=reverse_dictionary)

        # Save state of training and close the TensorBoard summary writer
        save_path = saver.save(sess, ckpt_saver_file_final, global_step)
        writer.add_summary(summary, global_step)
        writer.close()

        return W_in.eval()

关于深度学习框架

参考博客

MIIEo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tensorflow训练word2vec

Contenttrain_skip_gram()参数解析vocabulary_sizedata_folder & data_foldersnum_data_pairsreverse_dictionaryparamvalid_exampleslog_dirv_metadata_file_nameembeddings_pickleckpt_saver_fileckpt_saver_file_initckpt_saver_file_finalrestore_tf_variables_from_ckptt
复制链接

扫一扫