创新实训（7）——有关博客摘要的抽取算法——续（基于seq2seq和attention的文档自动摘要）

最新推荐文章于 2022-03-28 15:44:55 发布

日暮途远.

最新推荐文章于 2022-03-28 15:44:55 发布

阅读量542

点赞数 1

文章标签：项目实训

本文链接：https://blog.csdn.net/baidu_41871794/article/details/106675618

版权

项目实训专栏收录该内容

50 篇文章 5 订阅

订阅专栏

有关算法的了解

Sequence-to-Sequence又称为编、解码器（Encoder、Decoder）架构。其中Encoder、Decoder均由数层RNN／LSTM构成，Encoder负责把原文编码为一个向量C；Decoder负责从这个向量C中提取信息，获取语义，生成文本摘要，编码和解码都由神经网络实现。
在这里插入图片描述

通过Encoder将输入语句进行编码得到固定长度的Context Vector向量，这个编码过程实际上是一个信息有损压缩的过程；随后再将Context Vector传给Decoder进行翻译结果的生成，在Decoder端生成每个单词时，均参考来自Encoder端相同的Context Vector。

这种方式相对不够灵活，具体而言，当我们在翻译“机器学习”这的词的时候，并不关心这个词组前面的“我”和“爱”这两个字；而在翻译“我”的时候，也不关心“机器学习”这个词组。因此，一种更好的方式就是引入Attention机制，给予当前待翻译的词更多的权重，使得我们翻译每个词时会对源语句有不同的侧重，如下图所示：
在这里插入图片描述
上图中不同颜色代表着不同的Context Vector，我们在翻译每个单词时都有不同的Context Vector。以“machine”对应的Context Vector为例，其连接“机器”这个词的线更粗，代表着翻译时给予“机器”这个词更多的Attention，而翻译“learning”时则给予“学习”这个词更多的Attention。

emmm，看了半天算法的描述，大概有了一定的了解，说实话还是不明白他的神经网络是怎么构建的。
索性从github找了一个别人实现好的神经网络，来运行以下，看看最终的效果。

链接：https://github.com/ztz818/Automatic-generation-of-text-summaries

数据清洗

他只给了具体的算法描述，却没有提供数据集，所以我找到了搜狗新闻的数据集
链接：https://www.sogou.com/labs/resource/cs.php
在这里插入图片描述
数据集提供了新闻标题和新闻内容，正好可以作为生成摘要的语料进行训练。

这里，我下载了精简版，里边有一个月的数据集
在这里插入图片描述
根据此神经网络对数据集的需求，我们来处理数据
（1）首先，使用正则表达式，将文章的标题和文章的内容从数据集中抽取出来，分别存储到不同的文件之中，文件名和行数要一一对应，保证文本标题是相对的。
抽取到内容之后，需要去除文本中的标点符号和，英文，数字，特殊字符，然后对文本进行分词处理，得到结果。

def strQ2B(ustring):
    """全角转半角"""
    rstring = ""
    for uchar in ustring:
        inside_code = ord(uchar)
        if inside_code == 12288:  # 全角空格直接转换
            inside_code = 32
        elif (inside_code >= 65281 and inside_code <= 65374):  # 全角字符（除空格）根据关系转化
            inside_code -= 65248

        rstring += chr(inside_code)
    return rstring


import re
import jieba


def clean(text, vocab):
    text = strQ2B(text)
    text = re.sub(r'\ue40c', '', text)
    text = re.sub('[:「」￥…，,(【嘻嘻】)【哈哈】;"”“+/—!. _ - % \[\]*◎《》、。]', '', text)  # 去特殊字符
    text = re.sub('(www\.(.*?)\.com)|(http://(.*?)\.com)', '', text.lower())  # 去URL
    text = re.sub('[a-zA-Z]+', ' EN ', text)  # 去英文
    text = re.sub('([\d]*年*[\d]*月*[\d]+日+)|([\d]+年+)|([\d]*年*[\d]+月+)', ' DATE ', text)  # 去日期
    text = re.sub('[\d]+', ' NUMBER ', text)  # 去数字
    seg_list = jieba.cut(text, cut_all=False)
    text = ' '.join(seg_list)
    for word in text.split(' '):
        vocab[word] = vocab.get(word, 0) + 1
    # 结巴分词

    return text


raw_path = "data/data_test"
form_path = 'data/train_test'
vocab_path = 'data/vocab/vocab.txt'

import os


def raw_form(path, title_path, content_path, vocab):
    contents = []
    titles = []
    with open(path, 'r', encoding='gb18030')as f:
        text = ''.join(f.readlines())
        p = re.compile('<contenttitle>(.*)</contenttitle>\n<content>(.*)</content>')
        for temp in p.finditer(text):
            title = (temp.group(1))
            content = (temp.group(2))
            if title.strip() != '' and content.strip() != '':
                s1 = clean(title, vocab)
                if len(s1) > 60:
                    s1 = s1[:60]
                s2 = clean(content, vocab)
                if len(s2) > 240:
                    s2 = s2[:240]
                titles.append(s1)
                contents.append(s2)
    with open(title_path, 'w', encoding='utf-8') as wf:
        for temp in titles:
            wf.write(temp + '\n')
    with open(content_path, 'w', encoding='utf-8') as wf:
        for temp in contents:
            wf.write(temp + '\n')
    return contents, titles
def all_raw_fom(raw_path, form_path, vocab_path,max_vocabulary_size):
    vocab = {}
    special_words = ['<PAD>', '<UNK>', '<GO>', '<EOS>']
    file_list = os.listdir(raw_path)
    print(file_list)
    for item in file_list:
        _, _ = raw_form(os.path.join(raw_path, item), os.path.join(form_path, 'title_' + item),
                        os.path.join(form_path, 'content_' + item), vocab)

    vocab_list = special_words + sorted(vocab, key=vocab.get, reverse=True)
    if len(vocab_list)>max_vocabulary_size:
        vocab_list=vocab_list[:max_vocabulary_size]
    with open(vocab_path, 'w', encoding='utf-8')as f:
        for w in vocab_list:
            f.write(w + '\n')
    return
all_raw_fom(raw_path, form_path, vocab_path,50000)

处理后的结果为：
在这里插入图片描述

标题和文本的文件名是相对的，方便互相对应。

（2）将分词之后的文本向量化
从github找了个一个bert的数据集，里面包含了中文文本的语料
在这里插入图片描述

我们需要将语料中的序号，将新闻内容进行向量化，方面神经网络的训练

#tokenize
def fom_tokenize(form_path,tokenize_path,vocab_path):

    vocab_list=[]
    with open(vocab_path,'r',encoding='utf-8')as f:
        for item in f.readlines():
            vocab_list.append(item.strip())
    int_to_vocab = {idx: word for idx, word in enumerate(vocab_list)}
    vocab_to_int = {word: idx for idx, word in int_to_vocab.items()}
    file_list=os.listdir(form_path)
    print(file_list)
    for item in file_list:
        with open(os.path.join(form_path,item),'r',encoding='utf-8')as f:
            if re.compile('title').search(item):
                tokenize=[[vocab_to_int.get(word,vocab_to_int['<UNK>']) for word in sentence.strip().split(' ')]+[vocab_to_int['<EOS>']]for sentence in f.readlines()]
            else:
                tokenize=[[vocab_to_int.get(word,vocab_to_int['<UNK>']) for word in sentence.strip().split(' ')]for sentence in f.readlines()]
            with open(os.path.join(tokenize_path,item),'w',encoding='utf-8')as wf:
                for line in tokenize:
                    s=' '.join([str(w) for w in line])
                    wf.write(s+'\n')
form_path='data/train_test'
tokenize_path='data/tokenize'
vocab_path='data/vocab/vocab.txt'
fom_tokenize(form_path,tokenize_path,vocab_path)

在这里插入图片描述
实现了文本与标记的对应，将文本向量化。

模型的运行

由于我的电脑没有N卡，跑深度学习没有GPU加速很慢，所以使用了谷歌提供的colab来运行代码
在这里插入图片描述
我将数据集传到了谷歌云盘，然后新建了jupyter运行代码
其中也遇到了很多坑：
colab默认的环境是tensorflow2.0而这个神经网络的实现使用的是tensorflow1.0
找了很多教程才找到如何切换版本，最终顺利的运行了神经网络。
只需要静静的等待结果就好了
在这里插入图片描述