基于循环神经网络的神经语言模型（nlp实践1）

最新推荐文章于 2023-12-27 18:05:44 发布

strivinging

最新推荐文章于 2023-12-27 18:05:44 发布

阅读量869

点赞数 1

分类专栏： tensorflow tensorflow及pytorch在CV和NLP的实战

本文链接：https://blog.csdn.net/qq_22194315/article/details/85049801

版权

tensorflow 同时被 2 个专栏收录

18 篇文章 1 订阅

订阅专栏

tensorflow及pytorch在CV和NLP的实战

8 篇文章 1 订阅

订阅专栏

本实践采用的PTB数据集，基于tensorflow 实战google深度学习框架（第二版）一书所进行的实验，代码直接是用的书上的，为了更好地巩固知识，所以整理成博客。

首先得从Tomas Mikolov网站上下载PTB数据集，然后解压并进入文件夹中，

wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
tar xzvf simple-examples.tgz
cd simple-examples/
cd data/

我们只关注数据部分，data文件夹下有以下几个文件，但我们只使用ptb.train.txt、ptb.valid.txt和ptb.test.txt，它们都是已经被进行初步处理的数据，也就是两个单词间有一个空格（也就是分词），特殊符号处理成<unk>。

ptb.char.test.txt  ptb.char.train.txt  ptb.char.valid.txt  ptb.test.txt  ptb.train.txt  ptb.valid.txt  README

建立词库

在给定一个文本段落的情况下，我们得先建立一个词库，也就是包含段落中所有出现的单词的无重复集合，以下是实现建立词库的generate_vocab.py：

import codecs
import collections
from operator import itemgetter

RAW_DATA = "../ptb.train.txt"
VOCAB_OUTPUT = "ptb.vocab"

counter = collections.Counter()
with codecs.open(RAW_DATA,"r","utf-8") as f:
	for line in f:
		for word in line.strip().split():
			counter[word] += 1

sorted_word_to_cnt = sorted(counter.items(),key=itemgetter(1),reverse=True)
sorted_words = [x[0] for x in sorted_word_to_cnt]
sorted_words = ["<eos>"] + sorted_words

with codecs.open(VOCAB_OUTPUT,'w','utf-8') as file_output:
	for word in sorted_words:
		file_output.write(word+"\n")

其中codecs.open是为了防止文本中各个字符编码不统一导致的问题出现，与open功能基本相同。collections.Counter()是一个计数器类，其实跟dict很相似。itemgetter是一个获取列表项元素的函数，itemgetter(1)表示获得该项的一号位置元素。计数是为了统计单词在文本中出现的频率。

单词映射成编号

为了能使文本数据可处理，需要把单词都数字化才行，利用上面建立的词库来完成这一映射，也就是ptb.vocab的行号，下面是实现代码generate_id.py:

import codecs
import sys

RAW_DATA = "../ptb.test.txt"
VOCAB = "ptb.vocab"
OUTPUT_DATA = "ptb.test"

with codecs.open(VOCAB,"r","utf-8") as f_vocab:
	vocab = [w.strip() for w in f_vocab.readlines()]

word_to_id = {k:v for (k,v) in zip(vocab,range(len(vocab)))}

def get_id(word):
	return word_to_id[word] if word in word_to_id else word_to_id["unk"]

fin = codecs.open(RAW_DATA,"r","utf-8")
fout = codecs.open(OUTPUT_DATA,"w","utf-8")
for line in fin:
	words = line.strip().split() + ["<eos>"]
	out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
	fout.write(out_line)

fin.close()
fout.close()

注意将ptb.xxx.txt文本数据都转成ptb.xxx编号数据，那么就有ptb.train、ptb.valid和ptb.test三个被处理好的数据。

batching

为了利用上下文信息，对于PTB数据是将整个文本视为一个长句子，然后设定batch_size对长句子进行划分。简单来说就是#num_batches*batch_size*num_steps（句子的统一规定的长度）变成了#batch_size X #num_batches*num_steps，代码在后面的完整实例中。

建模并运行

 #coding: utf-8
import numpy as np
import tensorflow as tf
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
TRAIN_DATA = "./ptb.train"
EVAL_DATA = './ptb.valid'
TEST_DATA = "./ptb.test"
HIDDEN_SIZE = 300

NUM_LAYERS = 2
VOCAB_SIZE = 10000
TRAIN_BATCH_SIZE = 20
TRAIN_NUM_STEP = 35

EVAL_BATCH_SIZE = 1
EVAL_NUM_STEP = 1
NUM_EPOCH = 5
LSTM_KEEP_PROB = 0.9
EMBEDDING_KEEP_PROB = 0.9
MAX_GRAD_NORM = 5
SHARE_EMB_AND_SOFTMAX = True

class PTBModel(object):
	def __init__(self,is_training,batch_size,num_steps):
		self.batch_size = batch_size
		self.num_steps = num_steps

		self.input_data = tf.placeholder(tf.int32,[batch_size,num_steps])
		self.targets = tf.placeholder(tf.int32,[batch_size,num_steps])

		dropout_keep_prob = LSTM_KEEP_PROB if is_training else 1.0
		lstm_cells = [
			tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE),
				output_keep_prob=dropout_keep_prob) for _ in range(NUM_LAYERS)]
		cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)

		self.initial_state = cell.zero_state(batch_size,tf.float32)
		embedding = tf.get_variable("embedding",[VOCAB_SIZE,HIDDEN_SIZE])
		inputs = tf.nn.embedding_lookup(embedding,self.input_data)

		if is_training:
			inputs = tf.nn.dropout(inputs,EMBEDDING_KEEP_PROB)

		outputs = []
		state = self.initial_state
		with tf.variable_scope("RNN"):
			for time_step in range(num_steps):
				if time_step > 0: tf.get_variable_scope().reuse_variables()
				cell_output,state = cell(inputs[:,time_step,:],state)
				outputs.append(cell_output)

		output = tf.reshape(tf.concat(outputs,1),[-1,HIDDEN_SIZE])
		if SHARE_EMB_AND_SOFTMAX:
			weight = tf.transpose(embedding)
		else:
			weight = tf.get_variable("weight",[HIDDEN_SIZE,VOCAB_SIZE])

		bias = tf.get_variable("bias",[VOCAB_SIZE])
		logits = tf.matmul(output,weight) + bias
		loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
			labels=tf.reshape(self.targets,[-1]),
			logits=logits)

		self.cost = tf.reduce_sum(loss)/batch_size
		self.final_state = state

		if not is_training: return

		trainable_variables = tf.trainable_variables()
		grads,_ = tf.clip_by_global_norm(tf.gradients(self.cost,trainable_variables),MAX_GRAD_NORM)
		optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
		self.train_op = optimizer.apply_gradients(zip(grads,trainable_variables))


def run_epoch(session,model,batches,train_op,output_log,step):
	total_costs = 0.0
	iters = 0
	state = session.run(model.initial_state)
	for x,y in batches:
		cost,state,_ = session.run([model.cost,model.final_state,train_op],
			{model.input_data:x,model.targets:y,model.initial_state:state})
		total_costs += cost
		iters += model.num_steps

		if output_log and step%100==0:
			print("After %d steps, perplexity is %.3f"%(step,np.exp(total_costs/iters)))

		step += 1

	return step,np.exp(total_costs/iters)

def read_data(file_path):
	with open(file_path,"r") as fin:
		id_string = ' '.join([line.strip() for line in fin.readlines()])
	id_list = [int(w) for w in id_string.split()]
	return id_list

def make_batches(id_list,batch_size,num_step):
	num_batches = (len(id_list)-1)//(batch_size*num_step)
	data = np.array(id_list[:num_batches*batch_size*num_step])
	data = np.reshape(data,[batch_size,num_batches*num_step])
	data_batches = np.split(data,num_batches,axis=1)

	label = np.array(id_list[1:num_batches*batch_size*num_step+1])
	label = np.reshape(label,[batch_size,num_batches*num_step])
	label_batches = np.split(label,num_batches,axis=1)

	return list(zip(data_batches,label_batches))

def main():
	initializer = tf.random_uniform_initializer(-0.05,0.05)
	with tf.variable_scope("language_model",reuse=None,initializer=initializer):
		train_model = PTBModel(True,TRAIN_BATCH_SIZE,TRAIN_NUM_STEP)

	with tf.variable_scope("language_model",reuse=True,initializer=initializer):
		eval_model = PTBModel(False,EVAL_BATCH_SIZE,EVAL_NUM_STEP)

	gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7,allow_growth=True)
	sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
	with sess as session:
		tf.global_variables_initializer().run()
		train_batches = make_batches(
			read_data(TRAIN_DATA),TRAIN_BATCH_SIZE,TRAIN_NUM_STEP)
		eval_batches = make_batches(
			read_data(EVAL_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP)
		test_batches = make_batches(
			read_data(TEST_DATA),EVAL_BATCH_SIZE,EVAL_NUM_STEP)

		step = 0

		for i in range(NUM_EPOCH):
			print("In iteration: %d"%(i+1))
			step,train_pplx = run_epoch(session,train_model,train_batches,train_model.train_op,True,step)
			print("Epoch: %d train perplexity: %.3f"%(i+1,train_pplx))

			_,eval_pplx = run_epoch(session,eval_model,eval_batches,tf.no_op(),False,0)
			print("Epoch: %d eval perplexity: %.3f"%(i+1,eval_pplx))

		_,test_pplx = run_epoch(session,eval_model,test_batches,tf.no_op(),False,0)
		print("test perplexity: %.3f"%(test_pplx))
			
if __name__ == '__main__':
	main()

本人用的是tf1.12版本，下面是实验的部分结果：

After 2100 steps, perplexity is 148.493
After 2200 steps, perplexity is 145.569
After 2300 steps, perplexity is 144.454
After 2400 steps, perplexity is 142.137
After 2500 steps, perplexity is 139.358
After 2600 steps, perplexity is 136.011
Epoch: 2 train perplexity: 135.442
Epoch: 2 eval perplexity: 132.210
In iteration: 3
After 2700 steps, perplexity is 118.641
After 2800 steps, perplexity is 104.725
After 2900 steps, perplexity is 111.162
After 3000 steps, perplexity is 109.157
After 3100 steps, perplexity is 108.178
After 3200 steps, perplexity is 108.274
After 3300 steps, perplexity is 107.810
After 3400 steps, perplexity is 105.861
After 3500 steps, perplexity is 103.960
After 3600 steps, perplexity is 103.621
After 3700 steps, perplexity is 103.509
After 3800 steps, perplexity is 101.538
After 3900 steps, perplexity is 99.686
Epoch: 3 train perplexity: 99.343