Content
train_skip_gram()
以 文件形式存在的数据 和 代码中的变量 的对应关系
embeddings = train_skip_gram(vocabulary_size,
data_folder,
data_folders,
num_data_pairs,
reverse_dictionary,
param,
valid_examples,
log_dir,
v_metadata_file_name,
embeddings_pickle,
ckpt_saver_file,
ckpt_saver_file_init,
ckpt_saver_file_final,
restore_tf_variables_from_ckpt)
参数解析
vocabulary_size
# Get dictionary and vocabulary
print('\n\tGetting dictionary ...')
folder_vocabulary = os.path.join(data_folder, 'vocabulary')
dictionary_pickle = os.path.join(folder_vocabulary, 'dic_pickle')
with open(dictionary_pickle, 'rb') as f:
dictionary = pickle.load(f)
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
del dictionary
vocabulary_size = len(reverse_dictionary.keys())
python中处理文件的API,大多以文件的绝对路径作为函数参数。
上述代码含义,读 data_folder/vocabulary/dic_pickle
文件,求个len,赋值给 vocabulary_size
with as 用法的参考博客
with as 的例子
from absl import app
class Sample:
def __init__(self):
print("In __init__()")
def __enter__(self):
print("In __enter__()")
return "Foo"
def __exit__(self, type, value, trace):
print("In __exit__()")
def get_sample():
return Sample()
def main(argv):
del argv # 无用
with get_sample() as sample:
print("sample:", sample)
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
app.run(main) # 和tf.app.run()类似
上述代码的执行结果:
In __init__()
In __enter__()
sample: Foo
In __exit__()
data_folder & data_folders
data_folder 的值是 "data"
data_folders 是construct_xfg()的返回值
num_data_pairs
以data文件夹内的BLAS-3.8.0为例,BLAS-3.8.0文件夹中含有.ll的上级文件夹的只有blas自己,因此,在BLAS-3.8.0这个文件夹内的blas_dataset_cw_2文件夹内的data_pairs_cw_2.rec文件,是和num_data_pairs对应的原始文件,该文件是二进制文件,文本编辑器无法阅读
cw是context_width的缩写
2是指context_width这个参数等于2
reverse_dictionary
reverse_dictionary 其实是和 vocabulary_size 同时产生的,参见vocabulary_size 处的代码
param
param的数据类型是字典,key为FLAGS中的key,value为FLAGS[k].value
参考博客
valid_examples
# Validation set used to sample nearest neighbors
# Limit to the words that have a low numeric ID,
# which by construction are also the most frequent.
valid_size = 30 # Random set of words to evaluate similarity on.
valid_window = 50 # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
关于参数replace,用来设置是否可以取相同元素:
- True表示可以取相同数字;
- False表示不可以取相同数字。
- 默认是True
np.random.choice(50, 30, replace=False)
从0到49这50个数中,选30个数,不可以取相同的数字
np.random.choice(50, 30, replace=True)
从0到49这50个数中,选30个数,可以取相同的数字
np.random.choice的详细参考博客
log_dir
log_dir字符串的值对应的文件夹:data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5
log_dir就是日志文件夹
v_metadata_file_name
这个文件没有找到
embeddings_pickle
对应的文件:data/emb/emb_cw_2_embeddings/emb__data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5.p
ckpt_saver_file
对应的字符串:data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec.ckpt
ckpt_saver_file_init
对应的字符串:data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec-init.ckpt
ckpt_saver_file_final
对应的字符串:data/emb/emb_cw_2_train/data_d-200_m-64_s-60_e-0.001_r-0.0_cw-2_N-5/inst2vec-final.ckpt
restore_tf_variables_from_ckpt
类型为bool型:False
函数功能
训练模型(跳字模型)
函数流程
-
Extract parameters from dictionary
-
Set up for analogies
-
Read data using TensorFlow’s data API
-
Tensorflow computaional graph
- Placeholders for inputs
- (input) Embedding matrix
- Normalized embedding matrix
- (output) Embedding matrix (“output weights”)
- Optimization
-
Validation block
-
Summaries
-
Misc.
-
Training
Read data using TensorFlow’s data API
# Read data using Tensorflow's data API
data_files = get_data_pair_files(data_folders, context_width)
print('\ttraining with data from files:', data_files)
with tf.name_scope("Reader") as scope:
random.shuffle(data_files)
dataset_raw = tf.data.FixedLengthRecordDataset(filenames=data_files,
record_bytes=8) # <TFRecordDataset shapes: (), types: tf.string>
dataset = dataset_raw.map(record_parser)
dataset = dataset.shuffle(int(1e5))
dataset_batched = dataset.apply(tf.contrib.data.batch_and_drop_remainder(mini_batch_size))
dataset_batched = dataset_batched.prefetch(int(100000000))
iterator = dataset_batched.make_initializable_iterator()
saveable_iterator = tf.contrib.data.make_saveable_from_iterator(iterator)
next_batch = iterator.get_next() # Tensor("Shape:0", shape=(2,), dtype=int32)
Returns:
Dataset: A `Dataset`.
Tensorflow computaional graph
Placeholders for inputs
# Placeholders for inputs
with tf.name_scope("Input_Data") as scope:
train_inputs = next_batch[:, 0]
train_labels = tf.reshape(next_batch[:, 1], shape=[mini_batch_size, 1], name="training_labels")
此处withas的作用就记住便于tensorboard展示计算图就好了
用不用withas对整个训练没有任何影响参考博客
train_inputs是个啥玩意?啥形状?
train_inputs的形状仅根据next_batch我并没有看出来,但是结合后边的代码,train_inputs应该是vocabulary中每个单词的one-hot编码
(input) Embedding matrix
# (input) Embedding matrix
with tf.name_scope("Input_Layer") as scope:
W_in = tf.Variable(tf.random_uniform([V, N], -1.0, 1.0), name="input-embeddings")
# Look up the vector representing each source word in the batch (fetches rows of the embedding matrix)
h = tf.nn.embedding_lookup(W_in, train_inputs, name="input_embedding_vectors")
tf.random_uniform([V, N], -1.0, 1.0)
V是词汇表大小,也就是函数train_skip_gram的第一个参数,vocabulary_size
N是最后生成的词向量的长度,在本例中是200
tf.random_uniform应该是随机生成-1到1之间的浮点数,根据uniform看应该是生成的浮点数服从均匀分布。
Normalized embedding matrix
# Normalized embedding matrix
with tf.name_scope("Embeddings_Normalized") as scope:
normalized_embeddings = tf.nn.l2_normalize(W_in, name="embeddings_normalized")
此处是对W_in进行l2正则化处理,W_in是一个V*N
的随机矩阵,元素取值为-1到1之间的浮点数
正则化的公式,正则化的好处,请参考博客,我还不清楚。
(output) Embedding matrix (“output weights”)
# (output) Embedding matrix ("output weights")
with tf.name_scope("Output_Layer") as scope:
if FLAGS.softmax:
W_out = tf.Variable(tf.truncated_normal([N, V], stddev=1.0 / math.sqrt(N)), name="output_embeddings")
# Biases between hidden layer and output layer
b_out = tf.Variable(tf.zeros([V]), name="nce_bias")
FLAGS.softmax是bool型,值为True
这个过程干啥的,不知道。咋地,难道是隐藏层的权重?也就是模型的参数?
哦哦,模型的参数是W_out和b_out,一个是weight,一个是bias
Optimization
# Optimization
with tf.name_scope("Optimization_Block") as scope:
# Loss function
# FLAGS.softmax是bool型,值为`True`
if True:
# dense层是神经网络的最后一层,也就是输出层,input是输入维度(但h是个矩阵呐,咋回事),units是输出维度
logits = tf.layers.dense(inputs=h, units=V)
# 对train_labels进行one-hot编码
onehot = tf.one_hot(train_labels, V)
# logits是预测值,onehot是真实值
loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=onehot, logits=logits)
# 这个地方是不是缩进有错误,我不是很清楚【缩进没错】
train_loss = tf.reduce_mean(loss_tensor, name="nce_loss")
# Regularization (optional)
# l2_reg_scale的值为0.0
# l2_reg_scale的含义:scale of L2 regularization applied to weights (0: no regularization)
# 这么看,上边缩进就没错
if False:
else:
loss = train_loss
# Optimizer
# FLAGS.optimizer的值是adam
if FLAGS.optimizer == 'adam':
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
if FLAGS.optimizer != 'momentum':
global_train_step = tf.Variable(0, trainable=False, dtype=tf.int32, name="global_step")
FLAGS.softmax是bool型,值为True
Validation block
# Validation block
# valid_examples是0到49这50个整数中,随机选取的不相同的30个数
with tf.name_scope("Validation_Block") as scope:
valid_dataset = tf.constant(valid_examples, dtype=tf.int32, name="validation_data_size")
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
cosine_similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
Summaries
# Summaries
with tf.name_scope("Summaries") as scope:
tf.summary.histogram("input_embeddings", W_in)
tf.summary.histogram("input_embeddings_normalized", normalized_embeddings)
tf.summary.histogram("output_embeddings", W_out)
tf.summary.scalar("nce_loss", loss)
analogy_score_tensor = tf.Variable(0, trainable=False, dtype=tf.int32, name="analogy_score")
tf.summary.scalar("analogy_score", analogy_score_tensor)
Misc
# Misc.
restore_completed = False
init = tf.global_variables_initializer() # variables initializer
summary_op = tf.summary.merge_all() # merge summaries into one operation
Training
####################################################################################################################
# Training
with tf.Session(config=config) as sess:
# Add TensorBoard components
writer = tf.summary.FileWriter(log_dir) # create summary writer
writer.add_graph(sess.graph)
gvars = [gvar for gvar in tf.global_variables() if 'analogy_score' not in gvar.name]
saver = tf.train.Saver(gvars, max_to_keep=5) # create checkpoint saver
config = projector.ProjectorConfig() # create projector config
embedding = config.embeddings.add() # add embeddings visualizer
embedding.tensor_name = W_in.name
embedding.metadata_path = vocab_metada_file # link metadata
projector.visualize_embeddings(writer, config) # add writer and config to projector
# Set up variables
graph_saver = tf.train.Saver(allow_empty=True)
init.run()
graph_saver.save(sess, ckpt_saver_file_init, global_step=0, write_meta_graph=True)
tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable_iterator)
print("\tVariables initialized in TensorFlow")
# Compute the necessary number of steps for this epoch as well as how often to print the avg loss
num_steps = int(math.ceil(dataset_size / mini_batch_size))
step_print_loss = int(math.ceil(num_steps / freq_print_loss))
print('\tPrinting loss every ', step_print_loss, 'steps, i.e.', freq_print_loss, 'times per epoch')
################################################################################################################
# Epoch loop
epoch = 0
global_step = 0
while epoch < int(num_epochs):
print('\n\tStarting epoch ', epoch)
sess.run(iterator.initializer) # initialize iterator
############################################################################################################
# Loop over steps (mini batches) inside of epoch
step = 0
avg_loss = 0
while True:
try:
# Print average loss every x steps
if step_print_loss > 0 and step % int(step_print_loss) == 0: # update step with logging
# If restoring a previous training session, set the right training epoch
if restore_variables and not restore_completed:
restore_completed = True
# Write global step
if True:
global_train_step.assign(global_step).eval()
# Perform an update
# print('\tStarting local step {:>6}'.format(step)) # un-comment for debugging
[_, loss_val, train_loss_val, global_step] = sess.run(
[optimizer, loss, train_loss, global_train_step], options=options,
run_metadata=metadata)
assert not np.isnan(loss_val), "Loss at step " + str(step) + " is nan"
assert not np.isinf(loss_val), "Loss at step " + str(step) + " is inf"
avg_loss += loss_val
if step > 0:
avg_loss /= step_print_loss
analogy_score = i2v_eval.evaluate_analogies(W_in.eval(), reverse_dictionary, analogies,
analogy_types, analogy_evaluation_file,
session=sess, print=i2v_eval.nop)
total_analogy_score = sum([a[0] for a in analogy_score])
analogy_score_tensor.assign(total_analogy_score).eval() # for tf.summary
[summary, W_in_val] = sess.run([summary_op, W_in])
if FLAGS.savebest is not None:
filelist = [f for f in os.listdir(FLAGS.savebest)]
scorelist = [int(s.split('-')[1]) for s in filelist]
if len(scorelist) == 0 or total_analogy_score > sorted(scorelist)[-1]:
i2v_utils.safe_pickle(W_in_val, FLAGS.savebest + '/' + 'score-' +
str(total_analogy_score) + '-w.p')
# Display average loss
print('{} Avg. loss at epoch {:>6,d}, step {:>12,d} of {:>12,d}, global step {:>15} : {:>12.3f}, analogies: {})'.format(
str(datetime.now()), epoch, step, num_steps, global_step, avg_loss, str(analogy_score)))
avg_loss = 0
# Pickle intermediate embeddings
i2v_utils.safe_pickle(W_in_val, embeddings_pickle)
# Write to TensorBoard
saver.save(sess, ckpt_saver_file, global_step=global_step, write_meta_graph=False)
writer.add_summary(summary, global_step=global_step)
if step > 0 and FLAGS.extreme:
sys.exit(22)
else: # ordinary update step
[_, loss_val] = sess.run([optimizer, loss])
avg_loss += loss_val
# Compute and print nearest neighbors every x steps
if step_print_neighbors > 0 and step % int(step_print_neighbors) == 0:
print_neighbors(op=cosine_similarity, examples=valid_examples, top_k=6,
reverse_dictionary=reverse_dictionary)
# Update loop index (steps in epoch)
step += 1
global_step += 1
except tf.errors.OutOfRangeError:
# We reached the end of the epoch
print('\n\t Writing embeddings to file ', embeddings_pickle)
i2v_utils.safe_pickle([W_in.eval()], embeddings_pickle) # WEIRD!
epoch += 1 # update loop index (epochs)
break # from this inner loop
################################################################################################################
# End of training:
# Print the nearest neighbors at the end of the run
if step_print_neighbors == -1:
print_neighbors(op=cosine_similarity, examples=valid_examples, top_k=6,
reverse_dictionary=reverse_dictionary)
# Save state of training and close the TensorBoard summary writer
save_path = saver.save(sess, ckpt_saver_file_final, global_step)
writer.add_summary(summary, global_step)
writer.close()
return W_in.eval()