本节只讲了最朴素的方法,但实际上自动文摘有很多可优化的点,在第六次作业中完成。
1. Sentence Extraction
- Represent each sentence as a feature vector
- Compute score based on features
- Select n highest-ranking sentences
- Present in order in which they occur in text.
- Postprocessing to make summary more readable/concise
- Eliminate redundant sentences
- Replace anaphors/pronouns with noun phrases they refer to (指代消解)
- Delete subordinate clauses, parentheticals
根据句子的相似度+类pagerank方法来得到句子的重要程度。
在sentence extraction中,经常要加上一些feature:
- Fixed-phrase feature: certain phrases indicate summary, e.g. “in summary"
- Paragraph feature: Paragraph initial/final more likely to be important.
- Thematic word feature: Repetition is an indicator of importance
- Uppercase word feature: Uppercase often indicates named entities. (Taylor)
- Sentence length cut-off: Summary sentence should be > 5 words.
2. TextRank: Bringing Order into Text
把文章中的所有phrase抽取出来,如果一个词B落在以中心词A为中心,窗口大小为k的窗口中,A-B之间就增加一条边。每个短语的重要性就按照类似pagerank的方法来做。
Rouge评测
co-reference 指代消解
文本结构的检测:Lexical chain(词汇链)