自然语言处理系列-chat-bot(1)

最新推荐文章于 2022-12-01 15:30:25 发布

googler_offer

最新推荐文章于 2022-12-01 15:30:25 发布

阅读量631

点赞数

分类专栏：自然语言处理

自然语言处理专栏收录该内容

2 篇文章 0 订阅

订阅专栏

第一步：准备对话数据

A：复习了一晚上的机器学习。

B：看懂了？

A：看开了……

B： ……

但如果你跟她讲

“[ 0.006 -0.054 -0.101 ]”

此时，她会收到指令并随时准备去干活

一个词，对应一串数字，专业术语称作词向量

复习 → [ 0.006 -0.054 -0.101 ]

它们都被统一放在一个字典里

翻开字典，里面记录着每个词语，所对应的向量，通过字典，所以就能将每一个汉语替换为一串数字，喂给你的机器人。

所以可以用这样使用语料和字典

先将“语料”，切分成“一个一个的词或者字”，之后，从字典中，去寻找对应的词向量。分词->查字典->词向量->转换为词向量矩阵

当我们完成从“文本”到“词向量”的转换，并将每一组“A-B”对话，分别放进变量X和Y中，例如,

所以准备数据就已经完成了。

总结就是先分词，然后通过词典转换为词向量。需要采用如下包：

python,jieba,gensim,numpy,re

语料与数据地址如下：

https://pan.baidu.com/s/1dE2xOJ3，密码：mqu9

首先对语料进行分词：

# 对语料进行分词
def word_segment():

	inputFile_NoSegment = open('../data/chatterbot.txt','rb')
	outputFile_Segment = open('../data/chatterbot_segment.txt','w',encoding='utf-8')
	lines = inputFile_NoSegment.readlines()

	for line in lines:
		if line:
			# 采用结巴分词
			seg_list = jieba.cut(line.strip())
			segments = ''

			for word in seg_list:
				segments = segments+' '+word
			segments += '\n'
			segments = segments.lstrip()
			outputFile_Segment.write(segments)
	inputFile_NoSegment.close()
	outputFile_Segment.close()

然后将分词后的问-答句子，分别输入question和answer中。

# 将答句与答橘分别输入到question answer中
def question_answer():

	file = open('../data/chatterbot_segment.txt','r',encoding='utf-8')
	subtitles = file.read()
	
	question = []
	answer = []
	
	subtitles_list = subtitles.split('E')
	
	for q_a in subtitles_list:
		
		if re.findall('.*M.*M.*',q_a,flags = re.DOTALL):
			
			q_a = q_a.strip()
			q_a_pair = q_a.split('M')
			
			question.append(q_a_pair[1].strip())
			answer.append(q_a_pair[2].strip())
		
	file.close()
	
	return question,answer

然后将question和answer中的词语，转换为词向量，并将问答句的长度统一。

def qa_vector(question,answer):
	
	model = word2vec.Word2Vec.load('../data/word_vector/Word60.model')
	
	question_vector = []
	
	for q_sentence in question:
		
		q_word = q_sentence.split(' ')
		q_sentvec = [model[w] for w in q_word if w in model.wv.vocab]
		question_vector.append(q_sentvec)
	
	answer_vector = []
	
	for a_sentence in answer:
	
		a_word = a_sentence.split(' ')
		a_sentvec = [model[w] for w in a_word if w in model.wv.vocab]
		answer_vector.append(a_sentvec)
		
	word_dim = len(answer_vector[0][0])
	
	sentend = np.ones((word_dim,),dtype = np.float32)
	
	for sentvec in question_vector:
		
		if len(sentvec)>14:
			sentvec[14:] = []
			sentvec.append(sentend)
		
		else:
			
			for i in range(15-len(sentvec)):
				sentvec.append(sentend)
	return question_vector,answer_vector

整个流程就是如此，转自：http://www.aiportal.net/%E8%81%8A%E5%A4%A9%E6%9C%BA%E5%99%A8%E4%BA%BA/%E8%81%8A%E5%A4%A9%E6%9C%BA%E5%99%A8%E4%BA%BA-keras-%E8%AF%8D%E5%90%91%E9%87%8F

googler_offer

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
自然语言处理系列-chat-bot(1)

第一步：准备对话数据A：复习了一晚上的机器学习。B：看懂了？A：看开了……B： ……但如果你跟她讲“[ 0.006 -0.054 -0.101 ]”此时，她会收到指令并随时准备去干活一个词，对应一串数字，专业术语称作词向量复习 → [ 0.006 -0.054 -0.101 ]它们都被统一放在一个字典里翻开字典，里面记录着每个词...
复制链接

扫一扫