文本自动摘要之Pointer-Generator Networks训练步骤

最新推荐文章于 2023-12-31 09:50:17 发布

Steven灬

最新推荐文章于 2023-12-31 09:50:17 发布

阅读量1.4k

点赞数

分类专栏： AI与NLP NLP

本文链接：https://blog.csdn.net/weixin_40547993/article/details/103354603

版权

NLP 同时被 2 个专栏收录

15 篇文章 12 订阅

订阅专栏

AI与NLP

12 篇文章 1 订阅

订阅专栏

这程序是ACL 2017 paper *[Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368)*

Github链接：https://github.com/abisee/pointer-generator

第一步：准备训练数据

*****************************************这是新的README***********************************************
生成式任务：需要准备文本-摘要对作为训练语料
1）分词：将文本-摘要对进行jieba分词，分词不用去除停用词，保证文本语句的上下文关系（可添加自定义词典）
见程序merge_folder/cut_word.py
2) 将分词后的文本转为文本摘要对。具体格式如下：
单元A 单元B 单元C
文本摘要文本-摘要
最后的格式即为：
文本
摘要
文本
摘要
……
3）将预处理好的语料按7:2:1的比例进行划分为训练集、验证集、测试集txt文档

第二步：将.txt文件转化为训练所需的.bin文件

1）将上面得到的训练集、验证集、测试集txt文档分别放入 produce_data/train/train.txt produce_data/train/val.txt produce_data/train/test.txt
2) 运行produce_data/make_datafiles.py文件，将.txt文件转化为.bin文件。
执行该文件将在produce_data/finished_files文件夹下生成 train.bin val.bin test.bin vocab.bin共四个文件。
这样训练所需要的数据集就完全准备好了

代码如下：

# -*-coding:utf-8-*-
import os
import struct
import collections
from tensorflow.core.example import example_pb2

# We use these to separate the summary sentences in the .bin datafiles
SENTENCE_START = '<s>'
SENTENCE_END = '</s>'

train_file = "./train/train.txt"
val_file = "./val/val.txt"
test_file = "./test/test.txt"
finished_files_dir = "./finished_files"

VOCAB_SIZE = 200000


def read_text_file(text_file):
    lines = []
    with open(text_file, "r",encoding='utf-8') as f:
        for line in f:
            lines.append(line.strip())
    return lines


def write_to_bin(input_file, out_file, makevocab=False):
    if makevocab:
        vocab_counter = collections.Counter()

    with open(out_file, 'wb') as writer:
        # read the  input text file , make even line become article and odd line to be abstract（line number begin with 0）
        lines = read_text_file(input_file)
        for i, new_line in enumerate(lines):
            if i % 2 == 0:
                article = lines[i]
            if i % 2 != 0:
                abstract = "%s %s %s" % (SENTENCE_START, lines[i], SENTENCE_END)

                # Write to tf.Example
                tf_example = example_pb2.Example()
                tf_example.features.feature['article'].bytes_list.value.extend([bytes(article, encoding='utf-8')])
                # tf_example.features.feature['article'].bytes_list.value.extend([article])
                # tf_example.features.feature['abstract'].bytes_list.value.extend([abstract])
                tf_example.features.feature['abstract'].bytes_list.value.extend([bytes(abstract, encoding='utf-8')])
                tf_example_str = tf_example.SerializeToString()
                str_len = len(tf_example_str)
                writer.write(struct.pack('q', str_len))
                writer.write(struct.pack('%ds' % str_len, tf_example_str))

                # Write the vocab to file, if applicable
                if makevocab:
                    art_tokens = article.split(' ')
                    abs_tokens = abstract.split(' ')
                    abs_tokens = [t for t in abs_tokens if
                                  t not in [SENTENCE_START, SENTENCE_END]]  # remove these tags from vocab
                    tokens = art_tokens + abs_tokens
                    tokens = [t.strip() for t in tokens]  # strip
                    tokens = [t for t in tokens if t != ""]  # remove empty
                    vocab_counter.update(tokens)

    print("Finished writing file %s\n" % out_file)


    # write vocab to file
    if makevocab:
        print("Writing vocab file...")
        with open(os.path.join(finished_files_dir, "vocab.bin"), 'w',encoding='utf-8') as writer:
            for word, count in vocab_counter.most_common(VOCAB_SIZE):
                writer.write(word + ' ' + str(count) + '\n')
        print("Finished writing vocab file")



if __name__ == '__main__':

    if not os.path.exists(finished_files_dir): os.makedirs(finished_files_dir)

    # Read the text file, do a little postprocessing then write to bin files
    write_to_bin(test_file, os.path.join(finished_files_dir, "test.bin"))
    write_to_bin(val_file, os.path.join(finished_files_dir, "val.bin"))
    write_to_bin(train_file, os.path.join(finished_files_dir, "train.bin"), makevocab=True)

第三步: 训练模型

第三步: 训练模型
在cmd的DOS命名窗口中，进入程序所在的文件夹，输入以下命令：（先运行model=train，再运行mode=eval，两者同时进行）
python5 run_summarization.py --mode=train --data_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\train.bin --vocab_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\vocab.bin --log_root=E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log --exp_name=myexperiment
python5 run_summarization.py --mode=eval --data_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\val.bin --vocab_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\vocab.bin --log_root=E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log --exp_name=myexperiment
一直训练下去，这程序不会自动终止，可以根据训练的loss值情况自行终止训练，训练的模型存储在E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log\myexperiment\train文件夹下

参数设置：
data_path：训练或测试数据的路径
vocab_path：词表的路径
训练时：tf.app.flags.DEFINE_string('mode', 'train', 'must be one of train/eval/decode')
解码输出结果：tf.app.flags.DEFINE_string('mode', 'decode', 'must be one of train/eval/decode')
single_pass：仅适用于解码模式。如果为True，则使用固定检查点在完整数据集上运行eval，即获取当前检查点，并使用它为数据集中的每个示例生成一个摘要，将摘要写入文件，然后获取整个数据集的ROUGE分数。
如果为False（默认值），则运行并发解码，即重复加载最新检查点，使用它为随机选择的示例生成摘要，并将结果无限期地记录到屏幕。'）
log_root：日志的路径
exp_name：训练模型、decode结果以及文本测试结果存储的路径，在log_root文件夹下

超参数：
见run_summarization.py文件中52-64行代码
hidden_dim：隐藏层256
emb_dim：词向量维度128
batch_size：训练最小批次为16
max_enc_steps：编码器的最大时间步长为100（最大源文本标记）
max_dec_steps：解码器的最大时间步长为50（最大摘要标记）
beam_size：用于波束搜索解码的波束大小为4
min_dec_steps：生成摘要的最小序列长度为15。仅适用于波束搜索解码模式
vocab_size：词汇量大小50000。这些将按顺序从词汇表文件中读取。如果词汇表文件包含的字数少于此数字，或者如果此数字设置为0，则将获取词汇表文件中的所有单词
lr：学习率0.15
adagrad_init_acc：Adagrad的初始累加器值为0.1
rand_unif_init_mag：lstm单元随机均匀初始化的幅度为0.02
trunc_norm_init_std：trunc norm init的std，用于初始化其他所有内容
max_grad_norm：for gradient clipping

指针生成器或基线模型 pointer_gen：TURE OR FALSE 如果为True，则使用指针生成器模型。如果为False，则使用基线模型。
覆盖机制超参数 Coverage: TURE OR FALSE 使用覆盖机制。注意，在ACL论文训练中报告的实验没有覆盖直到收敛，然后训练一个短期相覆盖后。即，要在ACL文件中重现结果，在大多数训练中关闭它，然后在结束时打开一个短阶段。
cov_loss_wt：TURE OR FALSE 恢复eval / dir中的最佳模型并将其保存在train / dir中，随时可用于进一步训练。用于提前停止，或者如果您的训练检查点已被损坏，例如 NaN值
debug：TURE OR FALSE 在tensorflow的调试模式下运行（监视NaN / inf值）

第四步：合并测试结果txt文件

第四步：合并测试结果txt文件，在E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log\myexperiment\……文件夹下
运行merge_folder/merge_folders.py文件，实现合并同一个文件夹下所有txt文件，写入到一个txt文件中

其他见解释见原README翻译版如下：

*****************************************这是以前的README***********************************************

模型训练, run:
python run_summarization.py --mode=train --data_path=/path/to/chunked/train_* --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment

这将创建一个名为myexperiment的指定log_root的子目录，其中将保存所有检查点和其他数据。然后模型将使用train _ * .bin文件作为训练数据开始训练。

警告：使用上述命令中的默认设置，初始化模型和运行训练迭代都可能非常慢。为了加快速度，请尝试将以下标志（尤其是max_enc_steps和max_dec_steps）设置为小于run_summarization.py中指定的默认值：hidden_??dim，emb_dim，batch_size，max_enc_steps，max_dec_steps，vocab_size。

在训练期间增加序列长度：请注意，为了获得本文中描述的结果，我们在整个训练过程中逐步增加max_enc_steps和max_dec_steps的值（大多数情况下我们可以在训练的早期阶段执行更快的迭代）。如果您希望这样做，请从较小的max_enc_steps和max_dec_steps值开始，然后在想要增加值时使用较大的值中断并重新启动作业。

##########仅适用于解码模式。如果为True，则使用固定检查点在完整数据集上运行eval，即获取当前检查点，并使用它为数据集中的每个示例生成一个摘要，将摘要写入文件，然后获取整个数据集的ROUGE分数。如果为False（默认），则运行并发解码，即重复加载最新检查点，使用它生成随机选择的示例的摘要，并将结果无限期地记录到屏幕。

运行（并发）eval

您可能希望运行并发评估作业，该作业在验证集上运行您的模型并记录丢失。为此，请运行：

python run_summarization.py --mode=eval --data_path=/path/to/chunked/val_* --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment

注意：您希望使用为训练作业输入的相同设置运行上述命令。

还原快照：eval作业保存到目前为止在验证数据上获得最低损失的模型的快照。您可能希望恢复其中一个“最佳模型”，例如如果您的训练工作过度拟合，或者训练检查点已被NaN值破坏。要执行此操作，请运行train命令以及--restore_best_model = 1标志。这会将eval目录中的最佳模型复制到train目录中。然后再次运行常规列车命令。

运行beam search解码

要运行beam search解码：

python run_summarization.py --mode=decode --data_path=/path/to/chunked/val_* --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment

注意：您希望使用为训练作业输入的相同设置（以及任何解码模式特定标志，如beam_size）运行上述命令。

这将重复加载指定数据文件中的随机示例，并使用波束搜索生成摘要。结果将打印到屏幕上。

可视化输出：此外，解码作业生成一个名为attn_vis_data.json的文件。此文件提供浏览器内可视化工具所需的数据，允许您查看投影到文本上的注意力分布。要使用可视化工具，请按照此处的说明操作。

如果要对整个验证或测试集运行评估并获得ROUGE分数，请设置标志single_pass = 1。这将按顺序遍历整个数据集，将生成的摘要写入文件，然后使用pyrouge运行评估。（注意，这不会产生注意可视化器的attn_vis_data.json文件）。

用ROUGE评估

decode.py使用Python包pyrouge来运行ROUGE评估。 pyrouge为官方Perl ROUGE软件包提供了一个更易于使用的界面，您必须安装该软件包才能使用pyrouge。以下是有关如何执行此操作的一些有用说明：

如何设置Perl ROUGE
有关Perl ROUGE插件的更多详细信息
注意：截至2017年5月18日，官方Perl软件包的网站似乎已关闭。不幸的是，你需要从那里下载一个名为ROUGE-1.5.5的目录。作为替代方案，似乎你可以从这里获得该目录（但是，该repo中的pyrouge版本似乎已经过时，所以最好从官方源安装pyrouge）。

Tensorboard

从实验目录运行Tensorboard（在上面的示例中，myexperiment）。您应该能够看到train和eval运行的数据。如果您选择“嵌入”，您还应该看到您的单词嵌入可视化。

救命，我有NaNs！

由于难以诊断的原因，NaN有时会在训练期间发生，使得损失= NaN，有时还会使用NaN值破坏模型检查点，使其无法使用。以下是一些建议：

如果训练停止，损失不是有限的。停止。例外，您可以尝试重新启动。可能是检查点未损坏。
您可以使用inspect_checkpoint.py脚本检查您的检查点是否已损坏。如果它说所有值都是有限的，那么你的检查点就可以了，你可以尝试用它恢复训练。
训练作业设置为一次保留3个检查点（请参阅run_summarization.py中的max_to_keep变量）。如果您的新检查点已损坏，则可能是其中一个较旧的检查点未损坏。您可以通过编辑列车目录中的检查点文件来切换到该检查点。
或者，您可以从eval目录中恢复“最佳模型”。请参阅上面的恢复快照说明。
如果要尝试诊断NaN的原因，可以在启用--debug = 1标志的情况下运行。这将运行Tensorflow Debugger，它会检查NaN并在训练期间诊断其原因。

python5 run_summarization.py --mode=train --data_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\train.bin --vocab_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\vocab.bin --log_root=E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log --exp_name=myexperiment

python5 run_summarization.py --mode=eval --data_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\val.bin --vocab_path=E:\Pycharm\Githubbook\point_generator1\produce_data\\finished_files\\vocab.bin --log_root=E:\Pycharm\Githubbook\point_generator1\pointer_generator_master\log --exp_name=myexperiment

Steven灬

关注

0
点赞
踩
14

收藏

觉得还不错? 一键收藏
打赏
3
评论
文本自动摘要之Pointer-Generator Networks训练步骤

这程序是ACL 2017 paper *[Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368)*Github链接：https://github.com/abisee/pointer-generator第一步：准备训练数据*********...
复制链接

扫一扫