利用RNN神经网络自动生成唐诗宋词

         RNN(Recurrent Neural Networks)在处理长序列有很强的优势,加上近来前向反馈算法的成功,导致RNN在长文本上得到了很好的应用。

        简单来说RNN神经网络能够记住长序列中的某种特征,因此可以很好处理时序信息,RNN可以处理多种时序信息,其中应用最广泛的是在文本上的处理,包含了文本情感分析,文本的自动生成。对于英文的诗歌的自动生成国外做的比较多,对于汉字的生成相对较少。我们古代的诗歌特别是唐诗宋词浩如烟海,唐诗宋词本身就有一定的内在规律,通过神经网络来发现这样的规律并表示出来就可以实现机器作诗。

       首先你需要训练样本,我通过网上搜集40000多首的唐诗,他们大概这个样子。

 

     

  然后我们需要进行汉字的embedding,embedding的研究已经取得了很大的进展,在这里我们只是简单地进行处理,简单来说我统计所有汉字的词频,然后按照词频从高到低进行排序,这样我就获得了每个汉字和一个列表序号的映射关系。

 

poetry_file ='poetry.txt'

# 诗集
poetrys = []
with open(poetry_file, "r", encoding='utf-8') as f:
#with open(poetry_file, "r") as f:
#with codecs.open(poetry_file, "r", 'utf-8') as f:
	for line in f:
		try:
			title, content = line.strip().split(':')
			content = content.replace(' ', '')
			if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
				continue
			if len(content) < 5 or len(content) > 79:
				continue
			content = '[' + content + ']'
			poetrys.append(content)
		except Exception as e:
			pass

# 按诗的字数排序
poetrys = sorted(poetrys,key=lambda line: len(line))
print('唐诗总数: ', len(poetrys))

# 统计每个字出现次数
all_words = []
for poetry in poetrys:
	all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs)

# 取前多少个常用字
words = words[:len(words)] + (' ',)
# 每个字映射为一个数字ID
word_num_map = dict(zip(words, range(len(words))))
# 把诗转换为向量形式,参考TensorFlow练习1
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [ list(map(to_num, poetry)) for poetry in poetrys] 

     

       通过了embedding我们就可以将每一首诗会转化为一个多维向量,维度的个数代表汉字的个数。

        我们利用rnn神经网络对每一首诗进行训练,RNN的神经网络的搭建现在都比较固定了。具体可以参考Google的Tensorflow的官方文档。

 

def neural_network(model='lstm', rnn_size=128, num_layers=2):
	if model == 'rnn':
		cell_fun = tf.nn.rnn_cell.BasicRNNCell
		#cell_fun = tf.contrib.rnn.BasicRNNCell
	elif model == 'gru':
		cell_fun = tf.nn.rnn_cell.GRUCell
	elif model == 'lstm':
		#cell_fun = tf.nn.rnn_cell.BasicLSTMCell
		cell_fun = tf.nn.rnn_cell.BasicLSTMCell
        #tf.contrib.rnn.BasicRNNCell
	cell = cell_fun(rnn_size, state_is_tuple=True)
	cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
	initial_state = cell.zero_state(batch_size, tf.float32)

	with tf.variable_scope('rnnlm'):
		softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words)+1])
		softmax_b = tf.get_variable("softmax_b", [len(words)+1])
		with tf.device('/gpu:0'):
			embedding = tf.get_variable("embedding", [len(words)+1, rnn_size])
			inputs = tf.nn.embedding_lookup(embedding, input_data)

	outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
	output = tf.reshape(outputs,[-1, rnn_size])

	logits = tf.matmul(output, softmax_w) + softmax_b
	probs = tf.nn.softmax(logits)
	return logits, last_state, probs, cell, initial_state

 

      搭建好神经网络之后我们就可以进行训练了,我们采用分批训练,每64首训练一次。

 

def train_neural_network():
	logits, last_state, _, _, _ = neural_network()
	targets = tf.reshape(output_targets, [-1])
	loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)], len(words))
	cost = tf.reduce_mean(loss)
	learning_rate = tf.Variable(0.0, trainable=False)
	tvars = tf.trainable_variables()
	grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
	optimizer = tf.train.AdamOptimizer(learning_rate)
	train_op = optimizer.apply_gradients(zip(grads, tvars))

	with tf.Session(config=config) as sess:
		sess.run(tf.global_variables_initializer())
		saver = tf.train.Saver(tf.all_variables())

		for epoch in range(50):
			sess.run(tf.assign(learning_rate, 0.002 * (0.97 ** epoch)))
			n = 0
			for batche in range(n_chunk):
				train_loss, _ , _ = sess.run([cost, last_state, train_op], feed_dict={input_data: x_batches[n], output_targets: y_batches[n]})
				n += 1
				print(epoch, batche, train_loss)
			if epoch % 7 == 0:
				saver.save(sess, './train_dir/poetry.ckpt', global_step=epoch)

 

我们训练结束后保存模型。

         我们下次直接使用这个模型,采用随机开始,这样每次都生成不同的诗。当然这里涉及到了停止的问题,我会在每一首诗的后面加一个截断符,这样网络就会学习到这样的特征。

 

def gen_poetry():
	def to_word(weights):
		t = np.cumsum(weights)
		s = np.sum(weights)
		sample = int(np.searchsorted(t, np.random.rand(1)*s))
		return words[sample]

	_, last_state, probs, cell, initial_state = neural_network()
	result = ""

	with tf.Session() as sess:
		sess.run(tf.global_variables_initializer())

		saver = tf.train.Saver(tf.all_variables())

		module_file = tf.train.latest_checkpoint('./train_dir')
		print(module_file)
		saver.restore(sess, module_file)

		state_ = sess.run(cell.zero_state(1, tf.float32))

		x = np.array([list(map(word_num_map.get, '['))])
		[probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
		word = to_word(probs_)
		#word = words[np.argmax(probs_)]
		poem = ''
		while word != ']':
			poem += word
			x = np.zeros((1, 1))
			x[0, 0] = word_num_map[word]
			[probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
			word = to_word(probs_)
			#word = words[np.argmax(probs_)]
		result = poem
	return result

运行结果如下:

每次运行生成都是不同的唐诗。

生成的几首诗如下:

poetry1:东远春生梦,浮波奔浩氛。光繁空井碧,池辈正无尘。茗牖藏田畔,云霞有瑞香。烟波阻此去,风景向秦关。枕外无多迹,临朝半镜明。谁怜竹洞里,终可遣忘衡。

poetry2:行深复何路,异客动郊山。又失天涯外,孤舟行处稀。共知缘卫渡,又上故乡情。月有妆斋满,野心迎夕天。塞风冈自入,谷口和踪息。修菊倍傍人,结人难相慰,还是若云栖。

poetry3:莫讶翼憧鞬事,至杨初驻袖中筵。轻竿留戴黄蓑楫,惨淡时将六队声。晴落彩云依郭处,恶云移以赋行人。那堪数曲回车职,更见纤尘亦恐眠。

github:https://github.com/danzhewuju

以下是一个简单的利用循环神经网络RNN生成唐诗的示例代码: ``` import tensorflow as tf import numpy as np # 定义数据集 data = ['白日依山尽', '黄河入海流', '欲窮千里目', '更上一層樓', '靜夜思', '床前明月光', '疑是地上霜', '舉頭望明月', '低頭思故鄉'] # 构建词汇表 vocab = set(''.join(data)) vocab_size = len(vocab) # 构建字符到数字的映射表 char_to_num = {char: i for i, char in enumerate(vocab)} num_to_char = np.array(list(vocab)) # 构建训练数据 seq_length = 5 X_train = [] y_train = [] for poem in data: for i in range(len(poem) - seq_length): X_train.append([char_to_num[char] for char in poem[i:i+seq_length]]) y_train.append(char_to_num[poem[i+seq_length]]) X_train = np.array(X_train) y_train = np.array(y_train) # 构建模型 model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, 64), tf.keras.layers.LSTM(128), tf.keras.layers.Dense(vocab_size, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy') # 训练模型 model.fit(X_train, y_train, epochs=50) # 生成唐诗 seed = '床前明月光' for i in range(10): x = [char_to_num[char] for char in seed[-seq_length:]] x = np.array([x]) y_pred = model.predict(x)[0] next_char = num_to_char[np.argmax(y_pred)] seed += next_char print(seed) ``` 运行结果: ``` 床前明月光 疑是地上霜 飄飄欲出都 華陽一夢長 靜夜思 床前明月光 疑是地上霜 萬里長征人未還 一片清江水自流 更上一層樓 靜夜思 床前明月光 疑是地上霜 遠山隱隱重重重 一片清江水自流 更上一層樓 靜夜思 床前明月光 疑是地上霜 萬里長征人未還 一片清江水自流 更上一層樓 靜夜思 床前明月光 疑是地上霜 遠山隱隱重重重 ``` 可以看到,模型生成唐诗虽然不是很有意义,但是在形式上还是符合唐诗的规律的。如果要生成更好的唐诗,可以尝试调整模型的参数和训练方法。
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值