总结了一些海量数据课程所学的东西。
一.复述的定义:同一个意思的不同表达
二.复述的分类
按照粒度可以分为Surface Paraphrases和Structural paraphrases. Surface Paraphrases 有词汇,短语,句子,论述四个级别。Structural paraphrases有模式和搭配两个级别。按照复述的风格可以分为细小变化,短语替换,短语重排序,句子分割和合并,复杂复述。
三.复述的应用:
机器翻译:
Translate unknown terms (phrases)
Expand training data
Rewrite input sentences
Improve automatic evaluation
Tune parameters
问答系统,信息提取,信息检索,总结,自然语言生成。
四.复述的识别:基于分类的方法和基于对齐的方法
4.1 典型的基于分类的方法:
1 Brockett and Dolan, 2005
特点:String相似特征:句子长度,单词覆盖,编辑距离
形态变体
wordNet 词汇映射
词关系对:同义词
分类器:SVM
2. Finch et al., 2005:使用机器翻译评估方法计算句子的相似度
Feature vector vec(s1, s2)vec1(s1, s2): s1as reference, s2as MT system output;
vec2(s1, s2): s2as reference, s1as MT system output;
vec(s1, s2): average of vec1(s1, s2) and vec2(s1, s2)
3.Malakasiotis, 2009
组合了多种分类方法
String similarity (various levels)
Tokens, stems, POS tags, nouns only, verbs only, …
Different measures
Edit distance, Jaro-Winkler distance, Manhattan distance…
同义词相似度
Treat synonyms in two sentences as identical words
句法相似度
Dependency parsing of two sentences and compute the overlap of dependencies
4.2 基于对齐的方法:
1.Wu, 2005
Conduct alignment based on Inversion Transduction Grammars (ITG)
对句子结构敏感,不用任何词库处理词汇变化
性能和基于分类方法差不多,识别文本蕴含时性能也很好
2.Das and Smith, 2009
Conduct alignment based on Quasi-Synchronous Dependency Grammar (QG)
Alignment between two dependency trees
Assumption: the dependency trees of two paraphrase sentences should be aligned closely
Summary:
Classification based method is still the mainstream method, since:
Binary classification problem is well defined;
Classification algorithms and tools are readily available;
It can combine various features in a simple way;
It achieves state-of-the-art performance.
五.复述提取
1. 词典
2. 单语平行语料库
3. 单语可比语料库
4. 双语平行语料库
4.1 Takao et al., 2002
Basic idea:
Generating lexical paraphrases using 2-way dictionaries
English word e1can be translated to a Japanese word jwith an E-J dic. D1, and then jcan be translated back to an English word e2with a J-E dictionary D2. e1and e2are extracted as paraphrases
4.2 Bannard and Callison-Burch, 2005
Word alignment and phrase extraction
Basic assumption:
If two English phrases e1and e2can be aligned with the same foreign phrase f, e1and e2are likely to be paraphrases.)
4.3 Callison-Burch, 2008Basic idea:Two paraphrase phrases should have the same syntactic type.
Syntactic constraints are also used when substituting paraphrases in sentences
4.4 Kok and Brockett, 2010 Basic idea:Convert aligned phrases into a graph, extract paraphrases based on random walks and hitting times
5. 网络语料库
6. 词典注解