3.3 LSTM实现简单的问答系统
3.3.1 问答系统简介
##3.3.2 基于keras实现简单的问答系统
模型逻辑图如下:
数据集:Facebook的bAbI数据
训练集:
1 Mary moved to the bathroom.
2 Sandra journeyed to the bedroom.
3 Mary got the football there.
4 John went to the kitchen.
5 Mary went back to the kitchen.
6 Mary went back to the garden.
7 Where is the football? garden 3 6
8 Sandra went back to the office.
9 John moved to the office.
10 Sandra journeyed to the hallway.
11 Daniel went back to the kitchen.
12 Mary dropped the football.
13 John got the milk there.
14 Where is the football? garden 12 6
15 Mary took the football there.
16 Sandra picked up the apple there.
17 Mary travelled to the hallway.
18 John journeyed to the kitchen.
19 Where is the football? hallway 15 17
训练集是对话 + 问题 + 答案的形式,每个问句中以tab键分割问题、答案以及含有答案的句子索引。
接下来利用两个循环神经网络实现简单的问答系统。
(1)获取预处理
数据在amazoneaws的网站上,如果在运行代码出现下载不成功,就要先把数据集下载下来,然后放到keras的数据集目录下。代码中有具体操作。
# 获取数据
from keras.utils.data_utils import get_file
import tarfile
try:
path = get_file('babi-tasks-v1-2.tar.gz', \
origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
print('Error downloading dataset, please download it manually:\n'
'$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
'$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
raise
(2)数据预处理
对文本数据进行向量化,word2vector
- 对文本数据 Tokenize,因为本数据集为英文,分词可直接用空格,如果数据集为中文,需要利用结巴或者其他分词器进行分词。
#将每个单词分割来
def tokenize(data):
import re
# ‘\W’ 匹配所有的字母数字下划线以外的字符
return [x.strip() for x in re.split(r"(\W+)?", data) if x.strip()]
- 解析对话文本
# parse_dialog 将所有的对话进行解析,返回tokenize后的(对话,问题,答案)
# 如果 only_supporting为真表明只返回含有答案的对话
def parse_dialog(lines, only_supporting = False):
data = []
dialog = []
for line in lines:
line = line.strip()
nid, line = line.split(' ',1)
nid = int(nid)
# 标号为1表示新的一段文本的开始,重新记录
if nid == 1:
dialog = []
#含有tab键的说明就是问题,将问题,答案和答案的索引分割开
if '\t' in line:
ques, ans, data_idx = line.split('\t')
ques