nlp 论文生成摘要_nlp365 nlp论文的第114天总结了科学文献的摘要系统-CSDN博客

nlp 论文生成摘要

内置 AI NLP365(INSIDE AI NLP365)

Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 257 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)

项目＃NLP365(+1)是我记录我的NLP的学习之旅的每一天在2020年随时检查出什么，我一直在学习，在过去257天这里。在本文的结尾，您可以找到按NLP领域分组的以前的论文摘要：)

Today’s NLP paper is A Summarization System for Scientific Documents. Below are the key takeaways of the research paper.

当今的NLP论文是科学文献摘要系统。以下是研究论文的主要内容。

目标与贡献 (Objective and Contribution)

Proposed IBM Science Summariser for summarising computer science research papers. The system can identify different scenarios such as discovery, exploration, and understanding of scientific documents. The proposed system summarises research papers in two ways: either in free-text query or by choosing categorised values such as scientific tasks, datasets, and more. The proposed system ingested 270,000 papers.

建议的IBM Science Summariser，用于汇总计算机科学研究文章。该系统可以识别不同的场景，例如发现，探索和理解科学文献。拟议的系统以两种方式总结研究论文：以自由文本查询或通过选择分类值(例如科学任务，数据集等)。拟议的系统吸收了270,000篇论文。

The IBM Science Summariser produces summaries that focuses on user’s queries (query-focused summarisation). It summarises various sections of the paper independently, allowing users to focus on relevant sections only. This allows for the interaction between user’s queries and various entities in the paper.

IBM Science Summariser产生了针对用户查询的摘要(针对查询的摘要)。它独立总结了本文的各个部分，使用户可以仅关注相关部分。这允许用户查询与论文中的各种实体之间的交互。

Figure below showcase the user-interface of the IBM Science Summariser. Users pose their queries (or use the filters on metadata fields). Relevant papers are then returned with summarisation results. Each section is clearly shown with entities accurately highlighted.

下图展示了IBM Science Summariser的用户界面。用户提出他们的查询(或使用元数据字段上的过滤器)。然后将相关文件连同摘要结果一起返回。清楚地显示了每个部分，并正确突出了实体。

Image for post — The UI of the IBM Science Summariser [1]

科学论文摘要-什么，为什么，如何？ (Summarisation of Scientific Articles — What, Why, How?)

WHAT DOES A SUMMARISATION SYSTEM FOR SCIENTIFIC PAPERS CONSISTS OF?

什么是科学论文组成的摘要系统？

Extracting the structure
提取结构
Extracting tables and figures from PDF
从PDF提取表格和图形
Identify important entities
确定重要实体
Generating useful summary
生成有用的摘要

为什么需要？(WHY IS THIS NEEDED?)

Below are the pain-points of academic researchers:

以下是学术研究人员的痛点：

Keeping up-to-date with current work
与当前工作保持同步
Preparing research project / grant request
准备研究项目/拨款申请
Preparing related works when writing a paper
撰写论文时准备相关作品
Checking novelty of an idea
检查创意的新颖性

The first pain point tend to happen daily / weekly, with information overload and lots of time spent reading papers. Pain points 2–4 are important but less frequent.

第一个痛点往往每天/每周发生一次，信息过载并且花费大量时间阅读论文。疼痛点2-4很重要，但不那么频繁。

研究人员如何搜索和阅读研究论文？ (HOW DO RESEARCHERS SEARCH AND READ RESEARCH PAPERS?)

Researchers search by either keywords, entities (such as task name, dataset name, or models etc) or citation. For example, “state of the art results for SQUAD”
研究人员可以通过关键字，实体(例如任务名称，数据集名称或模型等)或引文进行搜索。例如，“ SQUAD的最新成果”
Read the title –> abstract. However, researchers mentioned that abstract is not informative enough to determine relevancy
阅读标题–>摘要。但是，研究人员提到摘要不足以确定相关性

系统总览(System Overview)

The system (figure above) has two components:

该系统(上图)包含两个组件：

Ingestion pipeline and search engine (Elasticsearch)
提取管道和搜索引擎(Elasticsearch)
Summarisation
总结

摄取管道(Ingestion Pipeline)

The system contains 270,000 papers from arXiv and ACL. The pipeline consists of 3 main steps:

该系统包含来自arXiv和ACL的270,000篇论文。管道包括3个主要步骤：

Extracting paper’s text, tables and figures
提取纸张的文字，表格和图形
Metadata enrichment with annotations and entities
具有注释和实体的元数据丰富
Entity Extraction
实体提取

The system uses Science-Parse to extract PDF text, tables and figures. Science-Parse supports figures and table extraction into an image file (and its caption text). Figure and table references in text paragraphs are detected. We also extracted tasks, datasets and metrics. The output is return in JSON format. Elasticsearch is used to index the papers where we index its title, abstract text, sections text and some metadata.

该系统使用Science-Parse提取PDF文本，表格和图形。 Science-Parse支持将图形和表格提取到图像文件(及其标题文本)中。检测到文本段落中的图形和表格引用。我们还提取了任务，数据集和指标。输出以JSON格式返回。 Elasticsearch用于索引论文的标题，摘要文本，部分文本和一些元数据。

The system has three types of entities: task, dataset, and metric. Both dictionary-based and learning-based approach are implemented. The dictionary-based are manually created using paperswithcode website. To cover all evolving topics, the learning-based approach is taken where we analyse the entire paper to extract the three types of entities. This was treated as a text entailment task whereby the paper contents is the text and the targeting Task-Dataset-Metric (TDM) triples as hypothesis. This approach forces the model to learn the similarity patterns between the text and the triples. Overall, the system has indexed 872 tasks, 345 datasets, and 62 metrics from the entire corpus.

系统具有三种类型的实体：任务，数据集和度量。基于字典的方法和基于学习的方法均得以实现。基于字典的字典是使用paperswithcode网站手动创建的。为了涵盖所有不断发展的主题，采用了基于学习的方法，在此方法中，我们分析了整篇论文，以提取三种类型的实体。这被视为文本附带任务，其中纸张内容为文本，而目标Task-Dataset-Metric(TDM)的目标是假设的三倍。这种方法迫使模型学习文本和三元组之间的相似性模式。总体而言，系统已索引了来自整个语料库的872个任务，345个数据集和62个指标。

总结 (Summarisation)

The summary can be generic or query-focused. The language can be quite different between sections and so sections are summarise independently and these section-based summaries are then composed together into one summary. The inputs of the summarisation are query (optional), entities, and the relevant papers returned by search engine. The summarisation is broken down into multiple steps:

该摘要可以是通用的或以查询为重点。各个部分之间的语言可能会完全不同，因此各个部分将分别进行摘要，然后将这些基于部分的摘要一起组成一个摘要。摘要的输入是查询(可选)，实体和搜索引擎返回的相关论文。总结分为多个步骤：

Query Handling
查询处理
Pre-processing
前处理
Summarisation
总结

If query Q is given, it can either be very precise or verbose. If it’s short and precise, we would expand it using query expansion, which transforms Q into 100 unigram terms (obtained from analysing top papers that are returned from the Q). If Q is verbose, a fixed-point weighting schema is used to rank the query terms. If no Q, key-phrases of the paper are used as proxy for the query.

如果给出查询Q，它可以非常精确或冗长。如果简短而精确，我们将使用查询扩展对其进行扩展，这会将Q转换为100个字母组合词(通过分析从Q返回的顶级论文获得)。如果Q是冗长的，则使用定点加权方案对查询词进行排名。如果没有Q，则将论文的关键短语用作查询的代理。

In terms of pre-processing, we perform sentence tokenisation, word tokenisation, lowercased and removal of stop words. Each sentence is then transform into uni-grams and bi-grams BoW representations.

在预处理方面，我们执行句子标记，词标记，小写和停用词的删除。然后将每个句子转换为单字和双字BoW表示形式。

In terms of summarisation, we used a SOTA unsupervised, extractive, query focused summarisation algorithm. The algorithm takes in the paper section, query Q, desired summary length (10 sentences), and a set of entities that are linked to the query. The generated summary is a subset of sentences from the paper section selected through an unsupervised optimisation scheme. This sentence selection is posed as a multi-criteria optimisation problem, where several summary quality objectives are considered. These summary qualities are:

在汇总方面，我们使用了SOTA无监督，提取式，查询集中的汇总算法。该算法接受论文部分，查询Q，所需的摘要长度(10个句子)以及链接到查询的一组实体。生成的摘要是通过无监督优化方案从论文部分中选出的句子的子集。该句子选择被视为一个多准则优化问题，其中考虑了几个简要质量目标。这些总结的素质是：

Query saliency. Does the summary contains many query related terms (cosine similarity)?
查询显着性。摘要中是否包含许多与查询相关的字词(余弦相似度)？
Entities coverage. Does the entities covered in the summary match with our set of entities?
实体覆盖率。摘要中涵盖的实体是否与我们的实体集匹配？
Text coverage. How much does the summary covers the paper section?
文字覆盖率。摘要涵盖论文部分多少？
Sentence length. We want the summaries to bias towards longer sentences, which are assumed to be more informative.
句子长度。我们希望这些摘要偏向于较长的句子，而这些句子被认为可以提供更多信息。

人工评估 (Human Evaluation)

评估设置(EVALUATION SETUP)

We approached 12 authors and asked them to evaluate summaries of two papers they have co-authored. This gives us a total of 24 papers. For each paper, we produced two types of summaries: section-based summary and section-agnostic summary (treating the paper content as flat text). This is for us to assess the benefit of section-bases summarisation. This gives us a total of 48 summaries to be evaluated.

我们联系了12位作者，并请他们评估他们共同撰写的两篇论文的摘要。这总共给我们提供了24篇论文。对于每篇论文，我们都会生成两种类型的摘要：基于章节的摘要和与章节无关的摘要(将纸张内容作为纯文本进行处理)。这是我们评估部分库摘要的好处。这使我们总共有48个要评估的摘要。

The authors are required to perform 3 tasks per summary:

每个摘要要求作者执行3个任务：

For each sentence, determine whether the sentence should be included as part of the summary (binary measure of the precision)
对于每个句子，确定该句子是否应作为摘要的一部分(精度的二进制度量)
How well each sections of the paper is covered in the summary (measure of recall, 1–5 scale, 3 being good)
摘要中各部分的覆盖程度(召回率，1-5分，3分良好)
Evaluate the overall quality of the summary (1–5 scale, 3 being good)
评估摘要的整体质量(1-5分，3分很好)

结果 (RESULTS)

The results are shown in the figure below. For task 2, section-based summary scored higher in 68% of the papers. The average score for section-based summaries is 3.32 which highlights the quality of section-based summaries.

结果如下图所示。对于任务2，基于节的摘要在68％的论文中得分更高。基于节的摘要的平均分数为3.32，突出了基于节的摘要的质量。

结论与未来工作 (Conclusion and Future Work)

As future work, the IBM Science Summariser plan on adding support for more entities and ingest more papers. More qualitative study is being conducted to assess its usage and quality of summaries, including automatic evaluation of summaries.

作为将来的工作，IBM Science Summariser计划增加对更多实体的支持并吸收更多论文。正在进行更多定性研究，以评估其使用和摘要的质量，包括自动评估摘要。

资源： (Source:)

[1] Erera, S., Shmueli-Scheuer, M., Feigenblat, G., Nakash, O.P., Boni, O., Roitman, H., Cohen, D., Weiner, B., Mass, Y., Rivlin, O. and Lev, G., 2019. A Summarization System for Scientific Documents. arXiv preprint arXiv:1908.11152.

[1] Erera，S.，Shmueli-Scheuer，M.，Feigenblat，G.，Nakash，OP，Boni，O.，Roitman，H.，Cohen，D.，Weiner，B.，Mass，Y.，Rivlin ，O。和Lev，G.，2019年。《科学文献摘要系统》。 arXiv预印本arXiv：1908.11152 。

Originally published at https://ryanong.co.uk on April 23, 2020.

最初于2020年4月23日发布在https://ryanong.co.uk 。