A Survey on Automatic Text Summarization

A Survey on Automatic Text Summarization

1.自动文本摘要的定义

Text summarization is compress the source text into a diminished version conserving its information content and overall meaning

1.1自动摘要的分类

单文档摘要和多文档摘要single , mul-summarization

1.2自动摘要方法类别

Extractive and abstactive summarization

2.自动摘要处理过程及EXTRACTION FEATURES

The most of the current automated text summarization systems use extradiction methods. Extractive summarization process can be divided into three phases.
First phase is Pre-Processing, second phase isProcessing.

2.1预处理常见方法

(1)Part of Speech(POS) Tagging 词性标注
这里写图片描述

(2)Stop Word Filtering 停用词过滤
a, an, in, by can be considered as a stop words and filtered from plain text
(3)Stemming 抓出词干
removing from –ed or –ing from verbs, using singular instead of plural noun, etc.
(4)Feature Calculation

2.1.1.Title Similarity:

这里写图片描述

2.1.2.Sentence Position:

这里写图片描述

2.1.3.Term Weight(Term frequency)

The total term weight is calculated by computing tf and idf for document.
Here idf refers to inverse document frequency which simply tells about whether the term is common or rare across all documents.
The score of important score wi of word i can be calculated by the traditional tf.idf methods.
这里写图片描述

2.1.4.Sentence Length

This feature is suitable when eliminating the sentences which are too short such as datelines or author names
适合日期,作者名字比较短的句子
这里写图片描述

2.1.5.Thematic Word 主题词

This feature is related with domain specific words which occur frequently in a document are probably related topic
经常出现的特殊词往往与话题有关
这里写图片描述

2.1.6.Proper Nouns 专有名词

这里写图片描述

2.1.7.Sentence to Sentence Similarity

这里写图片描述

2.4.8.Numerical Data

这里写图片描述

3. SUMMARIZATION METHODS

3.1.Query Based and Generic Summarization

在基于查询的文本摘要中,给定文档的句子的评分是基于单词或短语的频率计数。 包含查询短语的句子的分数较高,而单个查询词的分数较高。
这里写图片描述

3.1.1. Bayesian Classifier

这里写图片描述

3.1.2. Hidden Markov Model

the HMM does not assume that the probability that sentence i is in the summary is independent of whether sentence i-1 is in the summary
The main idea is using a sequential model to account for local dependencies between sentences. In HMM Model, three features were used:
 position of the sentence in the document,
 number of terms in the sentence,
 likeliness of the sentence terms given the document terms.
obtained the maximum-likelihood estimate for each transition probability,forming the transition matrix estimate
这里写图片描述

3.1.3. Neural Networks Based Text Summarization

f1 = Paragraph follows title (Paragraph Position)
f2 = Paragraph location in document
f3 = Sentence location paragraph
f4 = First sentence in paragraph
f5 = Sentence Length
f6 = Number of thematic words in sentence
f7 = Number of title words in sentence
Text Summarization process consists of three phases: training, feature fusion and sentence selection

3.1.4. Fuzzy Logic Based Text Summarization

模糊逻辑方法使用模糊规则和三角形隶属函数。模糊规则是IF-THEN的形式。三角形隶属函数将每个得分模糊为3个值中的一个,即LOW,MEDIUM和HIGH

4.EVALUATION

这里写图片描述
这里写图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值