基于循环神经网络的神经语言模型(nlp实践1)

 本实践采用的PTB数据集,基于tensorflow 实战google深度学习框架(第二版)一书所进行的实验,代码直接是用的书上的,为了更好地巩固知识,所以整理成博客。

首先得从Tomas Mikolov网站上下载PTB数据集,然后解压并进入文件夹中,

wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
tar xzvf simple-examples.tgz
cd simple-examples/
cd data/

我们只关注数据部分,data文件夹下有以下几个文件,但我们只使用ptb.train.txt、ptb.valid.txt和ptb.test.txt,它们都是已经被进行初步处理的数据,也就是两个单词间有一个空格(也就是分词),特殊符号处理成<unk>。

ptb.char.test.txt  ptb.char.train.txt  ptb.char.valid.txt  ptb.test.txt  ptb.train.txt  ptb.valid.txt  README
  • 建立词库

在给定一个文本段落的情况下,我们得先建立一个词库,也就是包含段落中所有出现的单词的无重复集合,以下是实现建立词库的generate_vocab.py:

import codecs
import collections
from operator import itemgetter

RAW_DATA = "../ptb.train.txt"
VOCAB_OUTPUT = "ptb.vocab"

counter = collections.Counter()
with codecs.open(RAW_DATA,"r","utf-8") as f:
	for line in f:
		for word in line.strip().split():
			counter[word] += 1

sorted_word_to_cnt = sorted(counter.items(),key=itemgetter(1),reverse=True)
sorted_words = [x[0] for x in sorted_word_to_cnt]
sorted_words = ["<eos>"] + sorted_words

with codecs.open(VOCAB_OUTPUT,'w','utf-8') as file_output:
	for word in sorted_words:
		file_output.write(word+"\n")

其中codecs.open是为了防止文本中各个字符编码不统一导致的问题出现,与open功能基本相同。collections.Counter()是一个计数器类,其实跟dict很相似。itemgetter是一个获取列表项元素的函数,itemgetter(1)表示获得该项的一号位置元素。计数是为了统计单词在文本中出现的频率。

  • 单词映射成编号

为了能使文本数据可处理,需要把单词都数字化才行,利用上面建立的词库来完成这一映射,也就是ptb.vocab的行号,下面是实现代码generate_id.py:

import codecs
import sys

RAW_DATA = "../ptb.test.txt"
VOCAB = "ptb.vocab"
OUTPUT_DATA = "ptb.test"

with codecs.open(VOCAB,"r","utf-8") as f_vocab:
	vocab = [w.strip() for w in f_vocab.readlines()]

word_to_id = {k:v for (k,v) in zip(vocab,range(len(vocab)))}

def get_id(word):
	return word_to_id[word] if word in word_to_id else word_to_id["unk"]

fin = codecs.open(RAW_DATA,"r","utf-8")
fout = codecs.open(OUTPUT_DATA,"w","utf-8")
for line in fin:
	words = line.strip().split() + ["<eos>"]
	out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
	fout.write(out_line)

fin.close()
fout.close()

注意将ptb.xxx.txt文本数据都转成ptb.xxx编号数据,那么就有ptb.train、ptb.valid和ptb.test三个被处理好的数据。

  • batching

为了利用上下文信息,对于PTB数据是将整个文本视为一个长句子,然后设定batch_size对长句子进行划分。简单来说就是#num_batches*batch_size*num_steps(句子的统一规定的长度)变成了#batch_size X #num_batches*num_steps,代码在后面的完整实例中。

  • 建模并运行
     #coding: utf-8
    import numpy as np
    import tensorflow as tf
    import os
    os.environ["CUDA_VISIBLE_DEVICES"]="1"
    TRAIN_DATA = "./ptb.train"
    EVAL_DATA = './ptb.valid'
    TEST_DATA = "./ptb.test"
    HIDDEN_SIZE = 300
    
    NUM_LAYERS = 2
    VOCAB_SIZE = 10000
    TRAIN_BATCH_SIZE = 20
    TRAIN_NUM_STEP = 35
    
    EVAL_BATCH_SIZE = 1
    EVAL_NUM_STEP = 1
    NUM_EPOCH = 5
    LSTM_KEEP_PROB = 0.9
    EMBEDDING_KEEP_PROB = 0.9
    MAX_GRAD_NORM = 5
    SHARE_EMB_AND_SOFTMAX = True
    
    class PTBModel(object):
    	def __init__(self,is_training,batch_size,num_steps):
    		self.batch_size = batch_size
    		self.num_steps = num_steps
    
    		self.input_data = tf.placeholder(tf.int32,[batch_size,num_steps])
    		self.targets = tf.placeholder(tf.int32,[batch_size,num_steps])
    
    		dropout_keep_prob = LSTM_KEEP_PROB if is_training else 1.0
    		lstm_cells = [
    			tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE),
    				output_keep_prob=dropout_keep_prob) for _ in range(NUM_LAYERS)]
    		cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)
    
    		self.initial_state = cell.zero_state(batch_size,tf.float32)
    		embedding = tf.get_variable("embedding",[VOCAB_SIZE,HIDDEN_SIZE])
    		inputs = tf.nn.embedding_lookup(embedding,self.input_data)
    
    		if is_training:
    			inputs = tf.nn.dropout(inputs,EMBEDDING_KEEP_PROB)
    
    		outputs = []
    		state = self.initial_state
    		with tf.variable_scope("RNN"):
    			for time_step in range(num_steps):
    				if time_step > 0: tf.get_variable_scope().reuse_variables()
    				cell_output,state = cell(inputs[:,time_step,:],state)
    				outputs.append(cell_output)
    
    		output = tf.reshape(tf.concat(outputs,1),[-1,HIDDEN_SIZE])
    		if SHARE_EMB_AND_SOFTMAX:
    			weight = tf.transpose(embedding)
    		else:
    			weight = tf.get_variable("weight",[HIDDEN_SIZE,VOCAB_SIZE])
    
    		bias = tf.get_variable("bias",[VOCAB_SIZE])
    		logits = tf.matmul(output,weight) + bias
    		loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
    			labels=tf.reshape(self.targets,[-1]),
    			logits=logits)
    
    		self.cost = tf.reduce_sum(loss)/batch_size
    		self.final_state = state
    
    		if not is_training: return
    
    		trainable_variables = tf.trainable_variables()
    		grads,_ = tf.clip_by_global_norm(tf.gradients(self.cost,trainable_variables),MAX_GRAD_NORM)
    		optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
    		self.train_op = optimizer.apply_gradients(zip(grads,trainable_variables))
    
    
    def run_epoch(session,model,batches,train_op,output_log,step):
    	total_costs = 0.0
    	iters = 0
    	state = session.run(model.initial_state)
    	for x,y in batches:
    		cost,state,_ = session.run([model.cost,model.final_state,train_op],
    			{model.input_data:x,model.targets:y,model.initial_state:state})
    		total_costs += cost
    		iters += model.num_steps
    
    		if output_log and step%100==0:
    			print("After %d steps, perplexity is %.3f"%(step,np.exp(total_costs/iters)))
    
    		step += 1
    
    	return step,np.exp(total_costs/iters)
    
    def read_data(file_path):
    	with open(file_path,"r") as fin:
    		id_string = ' '.join([line.strip() for line in fin.readlines()])
    	id_list = [int(w) for w in id_string.split()]
    	return id_list
    
    def make_batches(id_list,batch_size,num_step):
    	num_batches = (len(id_list)-1)//(batch_size*num_step)
    	data = np.array(id_list[:num_batches*batch_size*num_step])
    	data = np.reshape(data,[batch_size,num_batches*num_step])
    	data_batches = np.split(data,num_batches,axis=1)
    
    	label = np.array(id_list[1:num_batches*batch_size*num_step+1])
    	label = np.reshape(label,[batch_size,num_batches*num_step])
    	label_batches = np.split(label,num_batches,axis=1)
    
    	return list(zip(data_batches,label_batches))
    
    def main():
    	initializer = tf.random_uniform_initializer(-0.05,0.05)
    	with tf.variable_scope("language_model",reuse=None,initializer=initializer):
    		train_model = PTBModel(True,TRAIN_BATCH_SIZE,TRAIN_NUM_STEP)
    
    	with tf.variable_scope("language_model",reuse=True,initializer=initializer):
    		eval_model = PTBModel(False,EVAL_BATCH_SIZE,EVAL_NUM_STEP)
    
    	gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7,allow_growth=True)
    	sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
    	with sess as session:
    		tf.global_variables_initializer().run()
    		train_batches = make_batches(
    			read_data(TRAIN_DATA),TRAIN_BATCH_SIZE,TRAIN_NUM_STEP)
    		eval_batches = make_batches(
    			read_data(EVAL_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP)
    		test_batches = make_batches(
    			read_data(TEST_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP)
    
    		step = 0
    
    		for i in range(NUM_EPOCH):
    			print("In iteration: %d"%(i+1))
    			step,train_pplx = run_epoch(session,train_model,train_batches,train_model.train_op,True,step)
    			print("Epoch: %d train perplexity: %.3f"%(i+1,train_pplx))
    
    			_,eval_pplx = run_epoch(session,eval_model,eval_batches,tf.no_op(),False,0)
    			print("Epoch: %d eval perplexity: %.3f"%(i+1,eval_pplx))
    
    		_,test_pplx = run_epoch(session,eval_model,test_batches,tf.no_op(),False,0)
    		print("test perplexity: %.3f"%(test_pplx))
    			
    if __name__ == '__main__':
    	main()
    			

     

本人用的是tf1.12版本,下面是实验的部分结果:

After 2100 steps, perplexity is 148.493
After 2200 steps, perplexity is 145.569
After 2300 steps, perplexity is 144.454
After 2400 steps, perplexity is 142.137
After 2500 steps, perplexity is 139.358
After 2600 steps, perplexity is 136.011
Epoch: 2 train perplexity: 135.442
Epoch: 2 eval perplexity: 132.210
In iteration: 3
After 2700 steps, perplexity is 118.641
After 2800 steps, perplexity is 104.725
After 2900 steps, perplexity is 111.162
After 3000 steps, perplexity is 109.157
After 3100 steps, perplexity is 108.178
After 3200 steps, perplexity is 108.274
After 3300 steps, perplexity is 107.810
After 3400 steps, perplexity is 105.861
After 3500 steps, perplexity is 103.960
After 3600 steps, perplexity is 103.621
After 3700 steps, perplexity is 103.509
After 3800 steps, perplexity is 101.538
After 3900 steps, perplexity is 99.686
Epoch: 3 train perplexity: 99.343

 

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值