人工智能大作业(的一部分),感觉这种文本生成还是用RNN之类的更方便效果也更好,但是因为作业要求用两种不同的思路实现同一功能,找了半天就一篇论文说了一下遗传算法实现宋词生成的大致思路,为了把两种方法分得足够开再加上本身学艺不精,只能按照那个写了个不怎么像的阉割版……先这么放着吧,提交也过了蛮久了忘得差不多了,简单记录一下,可能等以后学得更多点再继续改进,也欢迎有大佬看到这篇博客来指导一下菜鸡。
代码、数据和结果文件也传了一份在这里,不过代码基本都在下面贴了,数据文件自己简单处理一下就行,不建议下……
参考论文:
一种宋词自动生成的遗传算法及其机器实现
遗传算法简介:
(百科摘选)
遗传算法根据大自然中生物体进化规律而设计提出,是模拟达尔文生物进化论的自然选择和遗传学机理的生物进化过程的计算模型,是一种通过模拟自然进化过程搜索最优解的方法。该算法通过数学的方式,利用计算机仿真运算,将问题的求解过程转换成类似生物进化中的染色体基因的交叉、变异等过程。
总之就是参考了生物基因的遗传方式来对程序产生的结果进行一定的筛选和变换,然后迭代几轮根据一个判定条件得出最优结果。总的分为6步:
初始化编码->适应度计算->选择->交叉->变异(未到指定代数则回到适应度计算)->(达到指定代数后)将所得适应度最大的结果作为最优结果输出
思路简介:
词的格式较诗更为规整,针对固定的词牌名,不仅字数有规定,平仄也要合乎标准,虽然写起来很麻烦,但编码和条件排除的时候还算明确。所以随机编码时按照平仄格式进行编码,以词为单位进行个体总和,变异替换每个句子,交叉则是交叉句子,再选择,再循环。适应度则是根据所给主题进行关联度计算(论文中还提到了前后衔接的问题,不过我当时ddl太紧也没思路嫌麻烦砍掉了……所以最后99%结果狗屁不通)。适应度函数为关联度之和。关联度通过Word2Vec模型和所给关键词算出。
此实验采用卜算子的格律。
《卜算子》
(仄)仄仄平平,(仄)仄平平仄。(仄)仄平平仄仄平,(仄)仄平平仄。00011 00110 0011001 00110
(仄)仄仄平平,(仄)仄平平仄。(仄)仄平平仄仄平,(仄)仄平平仄。
(设定为平1仄0)由于上下平仄相同,词分为上下阕,则以一阙为一个个体,最终结果选取最高的两阙组合。
具体实现
本实验所用的数据文件:平仄表.json和获得能用来编码的词的宋词数据集.txt,在github上都能搜到,自己处理一下就行。原本采用整首宋词作为数据,在程序运行时进行拆分,但太耗时间且生成效果不佳故在考虑后提前切分成单个词。
关键词输入&相似度计算
输入并处理关键词,顺便加载下平仄表字典给后面用。
seed_word=input('请输入关键词!\n')
seed_word_list=[]
for x in jieba.posseg.cut(seed_word):#对种子词进行切割,后面计算关联度更加方便
seed_word_list.append(x.word)
rhythm_dict={}
time_rhythm_begin=time.time()
print("正在加载平仄表字典……")#加载平仄表字典
with open("平仄.json","r",encoding='utf-8') as file:
rhythm_dict=json.load(file)
time_load_rhythm=time.time()
print("加载成功!用时",time_load_rhythm-time_rhythm_begin,"s")
计算相似度,用的Word2Vec。
word_list=[]
print("正在加载相似词……")
time_similarity_begin=time.time()
print("正在分词……")
with open('宋词.txt','r',encoding='utf-8') as file:#以已给文件为基础进行种子词的关联度计算
for line in file.readlines():
words=line.split()
if words:
word_list.append(words)
print("分词已完成!计算相似值……")
path=get_tmpfile("word2vec.model")
word_vector="word.vector"
model=Word2Vec(word_list, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")#进行word2vec模型训练
model.wv.save_word2vec_format(word_vector)
similar_word=[]#用于存放相似词
similarity=[]#用于存放对应相似值
time_load_similarity=time.time()
print("加载成功!用时",time_load_similarity-time_similarity_begin,"s")
for word in model.wv.most_similar(positive=seed_word_list,topn=50000):#选取前50000个与种子词相似度最高的词语,topn可更改
similar_word.append(word[0])
similarity.append(word[1])
遗传算法的实现
有了相似值,算适应度岂不简单?
def f(ci,similar_word,similarity):#计算每首词中的关联值
sum=0
for sentence in ci:
for word in sentence:
sum+=similarity[similar_word.index(word)]
return sum
先编个码:
def encode(similar_word,rhythm_dict,count):#编码
father_gen=[]
rhythm_standard=[[0,0,0,1,1],[0,0,1,1,0],[0,0,1,1,0,0,1],[0,0,1,1,0]]#《卜算子》的平仄标准,编码时将严格按照此标准进行筛选
for i in range(count):
ci=[]
for sentence_standard in rhythm_standard:
sentence=[]
flag=False
word_sum=0
while flag==False:
word=random.choice(similar_word)#随机选取词语
if word_sum+len(word)>len(sentence_standard):
continue
else:
check=0
for character in word:
if character not in rhythm_dict.keys():#若存在字不包含在平仄表中的情况,则跳过,重新选词
break
if rhythm_dict[character]==sentence_standard[word_sum+word.index(character)-1]:
#平仄判断
check+=1
if check==len(word):
sentence.append(word)
word_sum+=len(word)
if word_sum==len(sentence_standard):
flag=True
ci.append(sentence)
father_gen.append(ci)
return father_gen#最后返回生成了n个阙的父代群体
交叉操作:
def cross(father_gen):#交叉策略:在句子的间隔处寻找断点,进行两两交叉,保证格式的规整
n=random.randint(1,len(father_gen)*0.3)
list=[x for x in range(len(father_gen))]
for i in range(n):
a=random.choice(list)
list.remove(a)
b=random.choice(list)
list.remove(b)#交叉的个体不会重复交叉
dir=random.randint(1,3)
new1=father_gen[a][0:dir]+father_gen[b][dir:]
new2=father_gen[b][0:dir]+father_gen[a][dir:]
father_gen[a]=new1
father_gen[b]=new2
return father_gen
变异操作:
def vary(father_gen,similar_word,rhythm_dict):
possibility=[0.97,0.03]#此处设定变异几率为3%
flag=[0,1]#判断是否发生变异
rhythm_standard=[[0,0,0,1,1],[0,0,1,1,0],[0,0,1,1,0,0,1],[0,0,1,1,0]]
vary_or_not=random.choices(flag,possibility,k=len(father_gen))
for i in range(len(father_gen)):
if vary_or_not[i]:
vary_index=random.randint(0,3)
del father_gen[i][vary_index]
sentence=[]
flag=False
word_sum=0
while flag==False:#变异策略:针对每个句子进行替换
word=random.choice(similar_word)
if word_sum+len(word)>len(rhythm_standard[vary_index]):
continue
else:
check=0
for character in word:
if character not in rhythm_dict.keys():
break
if rhythm_dict[character]==rhythm_standard[vary_index][word_sum+word.index(character)-1]:
check+=1
if check==len(word):
sentence.append(word)
word_sum+=len(word)
if word_sum==len(rhythm_standard[vary_index]):
flag=True
father_gen[i].insert(vary_index,sentence)
return father_gen
选择:
def choose(father_gen,similar_word,similarity):#以相似度之和为基础进行选择
suitable_f=[]
sum=0
for i in range(len(father_gen)):#计算适应度函数及其和
suitable_f.append(f(father_gen[i],similar_word,similarity))
sum+=suitable_f[i]
p=[suitable_f[i]/sum for i in range(len(father_gen))]#计算每个数被选择的概率
son_gen=random.choices(father_gen,p,k=len(father_gen))
return son_gen#返回子代
最后得出最优解的一点小小的处理(保证排除相似度过高的语句),将每次的结果存着,最后写入文件中:
def heridity(similar_word,similarity,rhythm_dict,generation=10):
t=0
print("正在编码……")
time_load_encode=time.time()
father_gen=encode(similar_word,rhythm_dict,count=50)
time_encode=time.time()
print("初代个体编码完成!用时",time_encode-time_load_encode,"s")
print("正在进行遗传操作……")
time_heridity=[]
time_heridity.append(time_encode)
while t<generation:
vary(father_gen,similar_word,rhythm_dict)
cross(father_gen)
father_gen=choose(father_gen,similar_word,similarity)
t+=1
time_heridity.append(time.time())
print("第",t,"代加载完成!用时",time_heridity[t]-time_heridity[t-1],"s")
print("全部遗传操作已完成!总用时",time_heridity[-1]-time_heridity[0],"s")
suitable_f=[f(father_gen[i],similar_word,similarity) for i in range(len(father_gen))]
index_1=suitable_f.index(max(suitable_f))#对最后一代个体选择出最佳的阙
del suitable_f[index_1]
results=[]
results.append(father_gen.pop(index_1))
index_2=suitable_f.index(max(suitable_f))
flag=False
for sentence in father_gen[index_2]:#判断两阙中是否存在相同语句
if sentence in results[0]:
flag=True
break
while flag:#持续判断,直到两阙中不再含有相同语句
del suitable_f[index_2]
del father_gen[index_2]
index_2=suitable_f.index(max(suitable_f))
flag=False
for sentence in father_gen[index_2]:
if sentence in results[0]:
flag=True
break
del suitable_f[index_2]
results.append(father_gen.pop(index_2))
sep=''
out_result=''
print('结果已生成!')
for result in results:#结果组装
for i in range(4):
if i==0 or i==2:
out_result=out_result+sep.join(result[i])+','
else:
out_result=out_result+sep.join(result[i])+'。'
out_result=out_result+'\n'
return out_result
可以设定生成几首词并写入文件中:
with open("卜算子.txt",'a',encoding='utf-8') as file:
while t<5:#t为批量生成个数,可更改
out_put=Heridity.heridity(similar_word,similarity,rhythm_dict)#通过遗传算法进行生成
file.write("卜算子·"+seed_word+"\n"+out_put+"\n")
t+=1
print("第",t,"首已生成!")
print("结果文件已生成!")
time_end=time.time()
print("共",t,"首,总共用时",time_end-time_rhythm_begin,"s")
两份代码全部如下:
包含遗传算法的文件Heridity.py
#Heridity.py
import jieba.posseg
import jieba
import math
import time
import numpy as np
import random
import json
from gensim.test.utils import common_texts,get_tmpfile
from gensim.models import Word2Vec
def f(ci,similar_word,similarity):#计算每首词中的关联值
sum=0
for sentence in ci:
for word in sentence:
sum+=similarity[similar_word.index(word)]
return sum
def encode(similar_word,rhythm_dict,count):#编码
father_gen=[]
rhythm_standard=[[0,0,0,1,1],[0,0,1,1,0],[0,0,1,1,0,0,1],[0,0,1,1,0]]#《卜算子》的平仄标准,编码时将严格按照此标准进行筛选
for i in range(count):
ci=[]
for sentence_standard in rhythm_standard:
sentence=[]
flag=False
word_sum=0
while flag==False:
word=random.choice(similar_word)#随机选取词语
if word_sum+len(word)>len(sentence_standard):
continue
else:
check=0
for character in word:
if character not in rhythm_dict.keys():#若存在字不包含在平仄表中的情况,则跳过,重新选词
break
if rhythm_dict[character]==sentence_standard[word_sum+word.index(character)-1]:
#平仄判断
check+=1
if check==len(word):
sentence.append(word)
word_sum+=len(word)
if word_sum==len(sentence_standard):
flag=True
ci.append(sentence)
father_gen.append(ci)
return father_gen#最后返回生成了n个阙的父代群体
def vary(father_gen,similar_word,rhythm_dict):
possibility=[0.97,0.03]#此处设定变异几率为3%
flag=[0,1]#判断是否发生变异
rhythm_standard=[[0,0,0,1,1],[0,0,1,1,0],[0,0,1,1,0,0,1],[0,0,1,1,0]]
vary_or_not=random.choices(flag,possibility,k=len(father_gen))
for i in range(len(father_gen)):
if vary_or_not[i]:
vary_index=random.randint(0,3)
del father_gen[i][vary_index]
sentence=[]
flag=False
word_sum=0
while flag==False:#变异策略:针对每个句子进行替换
word=random.choice(similar_word)
if word_sum+len(word)>len(rhythm_standard[vary_index]):
continue
else:
check=0
for character in word:
if character not in rhythm_dict.keys():
break
if rhythm_dict[character]==rhythm_standard[vary_index][word_sum+word.index(character)-1]:
check+=1
if check==len(word):
sentence.append(word)
word_sum+=len(word)
if word_sum==len(rhythm_standard[vary_index]):
flag=True
father_gen[i].insert(vary_index,sentence)
return father_gen
def cross(father_gen):#交叉策略:在句子的间隔处寻找断点,进行两两交叉,保证格式的规整
n=random.randint(1,len(father_gen)*0.3)
list=[x for x in range(len(father_gen))]
for i in range(n):
a=random.choice(list)
list.remove(a)
b=random.choice(list)
list.remove(b)#交叉的个体不会重复交叉
dir=random.randint(1,3)
new1=father_gen[a][0:dir]+father_gen[b][dir:]
new2=father_gen[b][0:dir]+father_gen[a][dir:]
father_gen[a]=new1
father_gen[b]=new2
return father_gen
def choose(father_gen,similar_word,similarity):#以相似度之和为基础进行选择
suitable_f=[]
sum=0
for i in range(len(father_gen)):#计算适应度函数及其和
suitable_f.append(f(father_gen[i],similar_word,similarity))
sum+=suitable_f[i]
p=[suitable_f[i]/sum for i in range(len(father_gen))]#计算每个数被选择的概率
son_gen=random.choices(father_gen,p,k=len(father_gen))
return son_gen#返回子代
def heridity(similar_word,similarity,rhythm_dict,generation=10):
t=0
print("正在编码……")
time_load_encode=time.time()
father_gen=encode(similar_word,rhythm_dict,count=50)
time_encode=time.time()
print("初代个体编码完成!用时",time_encode-time_load_encode,"s")
print("正在进行遗传操作……")
time_heridity=[]
time_heridity.append(time_encode)
while t<generation:
vary(father_gen,similar_word,rhythm_dict)
cross(father_gen)
father_gen=choose(father_gen,similar_word,similarity)
t+=1
time_heridity.append(time.time())
print("第",t,"代加载完成!用时",time_heridity[t]-time_heridity[t-1],"s")
print("全部遗传操作已完成!总用时",time_heridity[-1]-time_heridity[0],"s")
suitable_f=[f(father_gen[i],similar_word,similarity) for i in range(len(father_gen))]
index_1=suitable_f.index(max(suitable_f))#对最后一代个体选择出最佳的阙
del suitable_f[index_1]
results=[]
results.append(father_gen.pop(index_1))
index_2=suitable_f.index(max(suitable_f))
flag=False
for sentence in father_gen[index_2]:#判断两阙中是否存在相同语句
if sentence in results[0]:
flag=True
break
while flag:#持续判断,直到两阙中不再含有相同语句
del suitable_f[index_2]
del father_gen[index_2]
index_2=suitable_f.index(max(suitable_f))
flag=False
for sentence in father_gen[index_2]:
if sentence in results[0]:
flag=True
break
del suitable_f[index_2]
results.append(father_gen.pop(index_2))
sep=''
out_result=''
print('结果已生成!')
for result in results:#结果组装
for i in range(4):
if i==0 or i==2:
out_result=out_result+sep.join(result[i])+','
else:
out_result=out_result+sep.join(result[i])+'。'
out_result=out_result+'\n'
return out_result
最终运行的文件main.py:
import Heridity
import jieba.posseg
import jieba
import math
import time
import numpy as np
import random
import json
from gensim.test.utils import common_texts,get_tmpfile
from gensim.models import Word2Vec
if __name__ == '__main__':
t=0
seed_word=input('请输入关键词!\n')
seed_word_list=[]
for x in jieba.posseg.cut(seed_word):#对种子词进行切割,后面计算关联度更加方便
seed_word_list.append(x.word)
rhythm_dict={}
time_rhythm_begin=time.time()
print("正在加载平仄表字典……")#加载平仄表字典
with open("平仄.json","r",encoding='utf-8') as file:
rhythm_dict=json.load(file)
time_load_rhythm=time.time()
print("加载成功!用时",time_load_rhythm-time_rhythm_begin,"s")
word_list=[]
print("正在加载相似词……")
time_similarity_begin=time.time()
print("正在分词……")
with open('宋词.txt','r',encoding='utf-8') as file:#以已给文件为基础进行种子词的关联度计算
for line in file.readlines():
words=line.split()
if words:
word_list.append(words)
print("分词已完成!计算相似值……")
path=get_tmpfile("word2vec.model")
word_vector="word.vector"
model=Word2Vec(word_list, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")#进行word2vec模型训练
model.wv.save_word2vec_format(word_vector)
similar_word=[]#用于存放相似词
similarity=[]#用于存放对应相似值
time_load_similarity=time.time()
print("加载成功!用时",time_load_similarity-time_similarity_begin,"s")
for word in model.wv.most_similar(positive=seed_word_list,topn=50000):#选取前50000个与种子词相似度最高的词语,topn可更改
similar_word.append(word[0])
similarity.append(word[1])
with open("卜算子.txt",'a',encoding='utf-8') as file:
while t<5:#t为批量生成个数,可更改
out_put=Heridity.heridity(similar_word,similarity,rhythm_dict)#通过遗传算法进行生成
file.write("卜算子·"+seed_word+"\n"+out_put+"\n")
t+=1
print("第",t,"首已生成!")
print("结果文件已生成!")
time_end=time.time()
print("共",t,"首,总共用时",time_end-time_rhythm_begin,"s")
结果展示
狗屁不通……这还是排除了大部分生僻字和异体字之后的结果,只能说语句衔接的判断很重要哈哈哈
总结
虽然流程是这么回事,但最后结果明显不咋地,改进思路是再加个语句衔接的判断函数,根据语句的合理程度赋值,然后最后的适应值由关联度和衔接度两部分组成,可惜不好用深度学习之类的(因为另一种思路是用RNN实现的,不好重合),不过可能可以通过词性之类的来定死?只不过最后效果也可能过于死板,意义反而不大。改进空间还是很大的……