本实践采用的PTB数据集,基于tensorflow 实战google深度学习框架(第二版)一书所进行的实验,代码直接是用的书上的,为了更好地巩固知识,所以整理成博客。
首先得从Tomas Mikolov网站上下载PTB数据集,然后解压并进入文件夹中,
wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
tar xzvf simple-examples.tgz
cd simple-examples/
cd data/
我们只关注数据部分,data文件夹下有以下几个文件,但我们只使用ptb.train.txt、ptb.valid.txt和ptb.test.txt,它们都是已经被进行初步处理的数据,也就是两个单词间有一个空格(也就是分词),特殊符号处理成<unk>。
ptb.char.test.txt ptb.char.train.txt ptb.char.valid.txt ptb.test.txt ptb.train.txt ptb.valid.txt README
- 建立词库
在给定一个文本段落的情况下,我们得先建立一个词库,也就是包含段落中所有出现的单词的无重复集合,以下是实现建立词库的generate_vocab.py:
import codecs
import collections
from operator import itemgetter
RAW_DATA = "../ptb.train.txt"
VOCAB_OUTPUT = "ptb.vocab"
counter = collections.Counter()
with codecs.open(RAW_DATA,"r","utf-8") as f:
for line in f:
for word in line.strip().split():
counter[word] += 1
sorted_word_to_cnt = sorted(counter.items(),key=itemgetter(1),reverse=True)
sorted_words = [x[0] for x in sorted_word_to_cnt]
sorted_words = ["<eos>"] + sorted_words
with codecs.open(VOCAB_OUTPUT,'w','utf-8') as file_output:
for word in sorted_words:
file_output.write(word+"\n")
其中codecs.open是为了防止文本中各个字符编码不统一导致的问题出现,与open功能基本相同。collections.Counter()是一个计数器类,其实跟dict很相似。itemgetter是一个获取列表项元素的函数,itemgetter(1)表示获得该项的一号位置元素。计数是为了统计单词在文本中出现的频率。
- 单词映射成编号
为了能使文本数据可处理,需要把单词都数字化才行,利用上面建立的词库来完成这一映射,也就是ptb.vocab的行号,下面是实现代码generate_id.py:
import codecs
import sys
RAW_DATA = "../ptb.test.txt"
VOCAB = "ptb.vocab"
OUTPUT_DATA = "ptb.test"
with codecs.open(VOCAB,"r","utf-8") as f_vocab:
vocab = [w.strip() for w in f_vocab.readlines()]
word_to_id = {k:v for (k,v) in zip(vocab,range(len(vocab)))}
def get_id(word):
return word_to_id[word] if word in word_to_id else word_to_id["unk"]
fin = codecs.open(RAW_DATA,"r","utf-8")
fout = codecs.open(OUTPUT_DATA,"w","utf-8")
for line in fin:
words = line.strip().split() + ["<eos>"]
out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
fout.write(out_line)
fin.close()
fout.close()
注意将ptb.xxx.txt文本数据都转成ptb.xxx编号数据,那么就有ptb.train、ptb.valid和ptb.test三个被处理好的数据。
- batching
为了利用上下文信息,对于PTB数据是将整个文本视为一个长句子,然后设定batch_size对长句子进行划分。简单来说就是#num_batches*batch_size*num_steps(句子的统一规定的长度)变成了#batch_size X #num_batches*num_steps,代码在后面的完整实例中。
- 建模并运行
#coding: utf-8 import numpy as np import tensorflow as tf import os os.environ["CUDA_VISIBLE_DEVICES"]="1" TRAIN_DATA = "./ptb.train" EVAL_DATA = './ptb.valid' TEST_DATA = "./ptb.test" HIDDEN_SIZE = 300 NUM_LAYERS = 2 VOCAB_SIZE = 10000 TRAIN_BATCH_SIZE = 20 TRAIN_NUM_STEP = 35 EVAL_BATCH_SIZE = 1 EVAL_NUM_STEP = 1 NUM_EPOCH = 5 LSTM_KEEP_PROB = 0.9 EMBEDDING_KEEP_PROB = 0.9 MAX_GRAD_NORM = 5 SHARE_EMB_AND_SOFTMAX = True class PTBModel(object): def __init__(self,is_training,batch_size,num_steps): self.batch_size = batch_size self.num_steps = num_steps self.input_data = tf.placeholder(tf.int32,[batch_size,num_steps]) self.targets = tf.placeholder(tf.int32,[batch_size,num_steps]) dropout_keep_prob = LSTM_KEEP_PROB if is_training else 1.0 lstm_cells = [ tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE), output_keep_prob=dropout_keep_prob) for _ in range(NUM_LAYERS)] cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells) self.initial_state = cell.zero_state(batch_size,tf.float32) embedding = tf.get_variable("embedding",[VOCAB_SIZE,HIDDEN_SIZE]) inputs = tf.nn.embedding_lookup(embedding,self.input_data) if is_training: inputs = tf.nn.dropout(inputs,EMBEDDING_KEEP_PROB) outputs = [] state = self.initial_state with tf.variable_scope("RNN"): for time_step in range(num_steps): if time_step > 0: tf.get_variable_scope().reuse_variables() cell_output,state = cell(inputs[:,time_step,:],state) outputs.append(cell_output) output = tf.reshape(tf.concat(outputs,1),[-1,HIDDEN_SIZE]) if SHARE_EMB_AND_SOFTMAX: weight = tf.transpose(embedding) else: weight = tf.get_variable("weight",[HIDDEN_SIZE,VOCAB_SIZE]) bias = tf.get_variable("bias",[VOCAB_SIZE]) logits = tf.matmul(output,weight) + bias loss = tf.nn.sparse_softmax_cross_entropy_with_logits( labels=tf.reshape(self.targets,[-1]), logits=logits) self.cost = tf.reduce_sum(loss)/batch_size self.final_state = state if not is_training: return trainable_variables = tf.trainable_variables() grads,_ = tf.clip_by_global_norm(tf.gradients(self.cost,trainable_variables),MAX_GRAD_NORM) optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0) self.train_op = optimizer.apply_gradients(zip(grads,trainable_variables)) def run_epoch(session,model,batches,train_op,output_log,step): total_costs = 0.0 iters = 0 state = session.run(model.initial_state) for x,y in batches: cost,state,_ = session.run([model.cost,model.final_state,train_op], {model.input_data:x,model.targets:y,model.initial_state:state}) total_costs += cost iters += model.num_steps if output_log and step%100==0: print("After %d steps, perplexity is %.3f"%(step,np.exp(total_costs/iters))) step += 1 return step,np.exp(total_costs/iters) def read_data(file_path): with open(file_path,"r") as fin: id_string = ' '.join([line.strip() for line in fin.readlines()]) id_list = [int(w) for w in id_string.split()] return id_list def make_batches(id_list,batch_size,num_step): num_batches = (len(id_list)-1)//(batch_size*num_step) data = np.array(id_list[:num_batches*batch_size*num_step]) data = np.reshape(data,[batch_size,num_batches*num_step]) data_batches = np.split(data,num_batches,axis=1) label = np.array(id_list[1:num_batches*batch_size*num_step+1]) label = np.reshape(label,[batch_size,num_batches*num_step]) label_batches = np.split(label,num_batches,axis=1) return list(zip(data_batches,label_batches)) def main(): initializer = tf.random_uniform_initializer(-0.05,0.05) with tf.variable_scope("language_model",reuse=None,initializer=initializer): train_model = PTBModel(True,TRAIN_BATCH_SIZE,TRAIN_NUM_STEP) with tf.variable_scope("language_model",reuse=True,initializer=initializer): eval_model = PTBModel(False,EVAL_BATCH_SIZE,EVAL_NUM_STEP) gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7,allow_growth=True) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) with sess as session: tf.global_variables_initializer().run() train_batches = make_batches( read_data(TRAIN_DATA),TRAIN_BATCH_SIZE,TRAIN_NUM_STEP) eval_batches = make_batches( read_data(EVAL_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP) test_batches = make_batches( read_data(TEST_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP) step = 0 for i in range(NUM_EPOCH): print("In iteration: %d"%(i+1)) step,train_pplx = run_epoch(session,train_model,train_batches,train_model.train_op,True,step) print("Epoch: %d train perplexity: %.3f"%(i+1,train_pplx)) _,eval_pplx = run_epoch(session,eval_model,eval_batches,tf.no_op(),False,0) print("Epoch: %d eval perplexity: %.3f"%(i+1,eval_pplx)) _,test_pplx = run_epoch(session,eval_model,test_batches,tf.no_op(),False,0) print("test perplexity: %.3f"%(test_pplx)) if __name__ == '__main__': main()
本人用的是tf1.12版本,下面是实验的部分结果:
After 2100 steps, perplexity is 148.493
After 2200 steps, perplexity is 145.569
After 2300 steps, perplexity is 144.454
After 2400 steps, perplexity is 142.137
After 2500 steps, perplexity is 139.358
After 2600 steps, perplexity is 136.011
Epoch: 2 train perplexity: 135.442
Epoch: 2 eval perplexity: 132.210
In iteration: 3
After 2700 steps, perplexity is 118.641
After 2800 steps, perplexity is 104.725
After 2900 steps, perplexity is 111.162
After 3000 steps, perplexity is 109.157
After 3100 steps, perplexity is 108.178
After 3200 steps, perplexity is 108.274
After 3300 steps, perplexity is 107.810
After 3400 steps, perplexity is 105.861
After 3500 steps, perplexity is 103.960
After 3600 steps, perplexity is 103.621
After 3700 steps, perplexity is 103.509
After 3800 steps, perplexity is 101.538
After 3900 steps, perplexity is 99.686
Epoch: 3 train perplexity: 99.343