小米 pegasus_使用Google的Pegasus库生成摘要

小米 pegasus

PEGASUS stands for Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. It uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. In this article, we will only focus on generating state of the art abstractive summaries using Google’s Pegasus library.

PEGASUS表示对于P再培训为E xtracted g接入点,句子对于A bstractive SU mmarization 小号层序对序列模型。 它使用自我监督的目标间隙句生成(GSG)来训练变压器编码器-解码器模型。 可以在arXiv上找到该论文。 在本文中,我们将只专注于使用Google的Pegasus库生成最新的抽象摘要。

As of now, there is no easy way to generate the summaries using Pegasus library. However, Hugging Face is already working on implementing this and they are expecting to release it around September 2020. In the meantime, we can try to follow the steps mentioned Pegasus Github repository and explore Pegasus. So let’s get started.

到目前为止,还没有使用Pegasus库生成摘要的简便方法。 但是, Hugging Face已经在努力实现此功能,他们希望在2020年9月左右发布它。与此同时,我们可以尝试按照提到的Pegasus Github存储库中的步骤进行操作,并探索Pegasus。 因此,让我们开始吧。

This step will clone the library on GitHub, create /content/pegasus folder, and install requirements.

此步骤将在GitHub上克隆库,创建/ content / pegasus文件夹,并安装需求。

Next, follow the instructions to install gsutil. The below steps worked well for me in Colab.

接下来,按照说明安装gsutil 。 以下步骤在Colab中对我来说效果很好。

This will create a folder named ckpt under /content/pegasus/ and then download all the necessary files (fine-tuned models, vocab etc.) from Google Cloud to /content/pegasus/ckpt.

这将在/ content / pegasus /下创建一个名为ckpt的文件夹 然后将所有必要的文件(微调模型,vocab等)从Google Cloud下载/ content / pegasus / ckpt

If all the above steps completed successfully, we see the below folder structure in Google Colab. Under each downstream dataset, we can see fine-tuned models that we can use for generating extractive/abstractive summaries.

如果上述所有步骤成功完成,我们将在Google Colab中看到以下文件夹结构。 在每个下游数据集下,我们可以看到可用于生成提取/抽象摘要的微调模型。

Image for post

Though it’s not mentioned in Pegasus Github repository README instruction, below pegasus installation step is necessary otherwise you will run into errors. Also, make sure you are in root folder /content before executing this step.

尽管Pegasus Github存储库README指令中未提及,但在飞马安装步骤下面是必需的,否则您将遇到错误。 另外,在执行此步骤之前,请确保您位于根目录/ content中

Now, let us try to understand about pre-training corpus and downstream datasets of Pegasus. Pegasus is pre-trained on C4 & Hugenews corpora and it is then fine-tuned on 12 downstream datasets. The evaluation results on downstream datasets are mentioned in Github and also in the paper. Some of these datasets are extractive & some are abstractive. So the use of the dataset depends on if we are looking for extractive summaries or abstractive summaries.

现在,让我们尝试了解有关Pegasus的预训练语料库和下游数据集。 飞马座在C4Hugenews语料库上进行了预训练,然后在12个下游数据集中进行了微调。 Github和论文中都提到了对下游数据集的评估结果。 这些数据集中有些是可提取的,有些则是抽象的。 因此,数据集的使用取决于我们是在寻找提取摘要还是抽象摘要。

Once all the above steps are taken care of, we can now jump to evaluate.py step mentioned below but it will take longer to complete as it will try to make predictions on all the data which are part of the evaluation set of the respective fine-tuned dataset being used. Since we are interested in summaries of custom text or sample text, we need to make minor changes public_params.py file found under /content/pegasus/pegasus/params/public_params.py as shown below.

完成上述所有步骤后,我们现在可以跳至以下提到的evaluate.py步骤。但是,由于它将尝试对属于相应标准的评估集的所有数据进行预测,因此需要更长的时间才能完成调整后的数据集。 由于我们对自定义文本或示例文本的摘要感兴趣,因此我们需要对public_params.py下的/content/pegasus/pegasus/params/public_params.py文件进行较小的更改。 如下图所示

Here I am making changes to reddit_tifu as I am trying to use reddit_tifu dataset for generating an abstractive summary. In case if you are experimenting with aeslc or other downstream datasets you are requested to make similar changes.

我在这里对reddit_tifu进行更改 当我尝试使用reddit_tifu数据集生成抽象摘要时。 如果您正在尝试使用aeslc或其他下游数据集,则需要进行类似的更改。

Here we are passing text from this news article is inp which is then copied to inputs. Note that empty string to passed to targets as this is what we are going to predict. Then both inputs are targets are used to create tfrecord, which pegusus expects.

在这里,我们正在传递新闻文章 inp文本,然后将其复制到inputs 。 请注意,传递给targets空字符串是我们要预测的。 那么这两个inputstargets被用于创建tfrecord,这pegusus预期。

inp = ‘“replace this with text from the above this article’’’

inp ='“用本文 上方的文字 替换

As the final step, when evaluate.py is run, the model makes a prediction or generates a summary of the above news article’s text. This will generate 4 output files in the respective downstream dataset’s folder. In this case input, output, prediction and text_metric text files will be created under reddit_tifu folder.

作为最后一步,当 evaluate.py运行,该模型进行预测或生成上述新闻报道的文字摘要。 这将在相应的下游数据集的文件夹中生成4个输出文件。 在这种情况下, inputoutputprediction text_metric 文本文件将在reddit_tifu文件夹下创建。

Image for post

Abstractive summary (prediction):“India and Afghanistan on Monday discussed the evolving security situation in the region against the backdrop of a spike in terrorist violence in the country.”

摘要摘要(预测): “印度和阿富汗周一讨论了该国恐怖活动激增的背景下该地区不断发展的安全局势。”

This looks like a very well generated abstractive summary when we compare with the news article we passed as input for generating the summary. By using different downstream datasets we can generate extractive or abstractive summaries. Also, we can play around with different parameter values and see how it changes summaries.

当我们与作为生成摘要的输入传递的新闻文章进行比较时,这看起来像是生成良好的摘要摘要。 通过使用不同的下游数据集,我们可以生成提取摘要或抽象摘要。 另外,我们可以尝试使用不同的参数值,并查看其如何更改摘要。

翻译自: https://towardsdatascience.com/generate-summaries-using-googles-pegasus-library-772633a161c2

小米 pegasus

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值