文本摘要提取_了解自动文本摘要-1:提取方法

文本摘要提取Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short...
摘要由CSDN通过智能技术生成

文本摘要提取

Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short summaries with all the important points over reading a whole report and summarizing it ourselves. So, several attempts had been made to automate the summarizing process. In this article, we will talk about some of them and see how they work.

几个网站和应用程序通常使用文本摘要来创建新闻提要和文章摘要。 由于我们繁忙的日程安排,对我们来说这已经变得非常重要。 与阅读整个报告并自己进行总结相比,我们更喜欢具有所有重要要点的简短摘要。 因此,已经进行了几次尝试来使摘要过程自动化。 在本文中,我们将讨论其中的一些,并了解它们的工作原理。

什么是总结? (What is summarization?)

Summarization is a technique to shorten long texts such that the summary has all the important points of the actual document.

摘要是一种缩短长文本的技术,以便摘要具有实际文档的所有要点。

There are mainly four types of summaries:

摘要主要有四种类型:

  1. Single Document Summary: Summary of a Single Document

    单个文档摘要:单个文档摘要
  2. Multi-Document Summary: Summary from multiple documents

    多文档摘要:来自多个文档的摘要
  3. Query Focused Summary: Summary of a specific query

    以查询为重点的摘要:特定查询的摘要
  4. Informative Summary: It includes a summary of the full information.

    信息摘要:包括完整信息的摘要。

自动汇总的方法 (Approaches to Automatic summarization)

There are mainly two types of summarization:

摘要主要有两种类型:

Extraction-based Summarization: The extractive approach involves picking up the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.

基于提取的摘要:提取方法涉及从文档中挑选最重要的短语和行。 然后,它将所有重要的行合并以创建摘要。 因此,在这种情况下,摘要的每一行和每个单词实际上都属于摘要的原始文档。

Abstraction-based Summarization: The abstractive approach involves summarization based on deep learning. So, it uses new phrases and terms, different from the actual document, keeping the points the same, just like how we actually summarize. So, it is much harder than the extractive approach.

基于抽象的摘要:抽象方法涉及基于深度学习的摘要。 因此,它使用与实际文档不同的新短语和术语,使要点保持不变,就像我们实际进行总结一样。 因此,这比提取方法难得多。

It has been observed that extractive summaries sometimes work better than the abstractive ones probably because extractive ones don’t require natural language generations and semantic representations.

已经观察到,提取摘要有时比抽象摘要更好,这可能是因为提取摘要不需要自然语言生成和语义表示。

评估方法 (Evaluation methods)

There are two types of evaluations:

评估有两种类型:

  • Human Evaluation

    人工评估
  • Automatic Evaluation

    自动评估

Human Evaluation: Scores are assigned by human experts based on how well the summary covers the points, answer the queries, and other factors like grammaticality and non-redundancy.

人工评估 :人工评估专家将根据摘要涵盖的分数,回答问题以及其他因素(如语法和非冗余)分配分数。

自动评估 (Automatic Evaluation)

ROUGE: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is the method that determines the quality of the summary by comparing it to other summaries made by humans as a reference. To evaluate the model, there are a number of references created by humans and the generated candidate summary by machine. The intuition behind this is if a model creates a good summary, then it must have common overlapping portions with the human references. It was proposed by Chin-Yew Lin, University of California.

ROUGE: ROUGE代表针对召回评估的面向召回的本科。 通过将摘要与人为参考的其他摘要进行比较来确定摘要的质量的方法。 为了评估模型,人类创建了许多参考,机器生成了候选摘要。 这背后的直觉是,如果模型创建了一个好的摘要,则它必须与人工参考具有共同的重叠部分。 它是由加利福尼亚大学的Chin-Yew Lin提出的。

Common versions of ROUGE are:

ROUGE的常见版本是:

ROUGE-n: It is measure on the comparison between the machine-generated output and the reference output based on n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech, i.e, it is simply a sequence of words. Bigrams mean two words, Trigrams mean 3 words and so on. We normally use Bigrams.

ROUGE-n:基于n-gram对机器生成的输出与参考输出之间的比较进行度量。 n-gram是来自给定文本或语音样本的n个项目的连续序列,即,它只是单词序列。 双字母组表示两个单词,三字母组表示三个单词,依此类推。 我们通常使用Bigrams。

Image for post
Source 资源

“Where p is “the number of common n-grams between candidate and reference summary”, and q is “the number of n-grams extracted from the reference summary only”. -Source

“其中p是“候选者和参考摘要之间的常见n元语法数”,q是“仅从参考摘要中提取的n元语法数”。 - 来源

ROUGE-L: It states that the longer the longest common subsequence in two texts, the similar they are. So, it is flexible than n-gram. It assigns scores based on how long can be a sequence, which is common to the machine-generated candidate and the human reference.

ROUGE-L:它指出,两个文本中最长的公共子序列越长,它们相似。 因此,它比n-gram灵活。 它根据序列的长度来分配分数,这对于机器生成的候选者和人工参考是通用的。

ROUGE-SU: It brings a concept of skip bi-grams and unigrams. Basically it allows or considers a bigram if there are some other words

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值