LSTM实现简单的问答系统，keras的bAbI

最新推荐文章于 2024-08-08 11:14:54 发布

Amy_mm

最新推荐文章于 2024-08-08 11:14:54 发布

阅读量6.7k

点赞数 9

分类专栏： LSTM 文章标签： babi LSTM 阅读理解

本文链接：https://blog.csdn.net/amy_mm/article/details/81084729

版权

本文介绍了如何使用Keras基于LSTM构建一个简单的问答系统，特别是针对Facebook的bAbI数据集。首先，数据预处理涉及文本向量化和Tokenize。接着，搭建神经网络模型，包括对话集和问题集的embedding、dropout，然后通过LSTM和merge层融合信息，最终通过softmax层预测答案。模型训练和预测示例进一步解释了整个流程。

摘要由CSDN通过智能技术生成

3.3 LSTM实现简单的问答系统

3.3.1 问答系统简介

##3.3.2 基于keras实现简单的问答系统

模型逻辑图如下：
这里写图片描述

数据集：Facebook的bAbI数据
训练集：

1 Mary moved to the bathroom.
2 Sandra journeyed to the bedroom.
3 Mary got the football there.
4 John went to the kitchen.
5 Mary went back to the kitchen.
6 Mary went back to the garden.
7 Where is the football? 	garden	3 6
8 Sandra went back to the office.
9 John moved to the office.
10 Sandra journeyed to the hallway.
11 Daniel went back to the kitchen.
12 Mary dropped the football.
13 John got the milk there.
14 Where is the football? 	garden	12 6
15 Mary took the football there.
16 Sandra picked up the apple there.
17 Mary travelled to the hallway.
18 John journeyed to the kitchen.
19 Where is the football? 	hallway	15 17
训练集是对话 + 问题 + 答案的形式，每个问句中以tab键分割问题、答案以及含有答案的句子索引。

接下来利用两个循环神经网络实现简单的问答系统。
（1）获取预处理
数据在amazoneaws的网站上，如果在运行代码出现下载不成功，就要先把数据集下载下来，然后放到keras的数据集目录下。代码中有具体操作。

# 获取数据
from keras.utils.data_utils import get_file
import tarfile
try:
    path = get_file('babi-tasks-v1-2.tar.gz', \
                    origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
except:
    print('Error downloading dataset, please download it manually:\n'
          '$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n'
          '$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')
    raise

（2）数据预处理
对文本数据进行向量化，word2vector

对文本数据 Tokenize，因为本数据集为英文，分词可直接用空格，如果数据集为中文，需要利用结巴或者其他分词器进行分词。

#将每个单词分割来
def tokenize(data):
    import re
    # ‘\W’ 匹配所有的字母数字下划线以外的字符
    return [x.strip() for x in re.split(r"(\W+)?", data) if x.strip()]

解析对话文本

# parse_dialog 将所有的对话进行解析，返回tokenize后的(对话,问题,答案)
# 如果 only_supporting为真表明只返回含有答案的对话
def parse_dialog(lines, only_supporting = False):
    data = []
    dialog = []
    for line in lines:
        line = line.strip()
        nid, line = line.split(' ',1)
        nid = int(nid)
        # 标号为1表示新的一段文本的开始，重新记录
        if nid == 1:
            dialog = []
        #含有tab键的说明就是问题，将问题，答案和答案的索引分割开
        if '\t' in line:
            ques, ans, data_idx = line.split('\t')
            ques