参照MPQA思考怎样实现观点挖掘系统(opinion mining)

MPQA是一个语料库和观点识别系统(Corpus and Opinion Recogntion System)。根据其网页,该系统有下面几部分:

  • 观点语料库 MPQA Opinion Corpus:annotated for opinions and sentiments
  • OpinionFinder: is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opiniondirect subjective expressions and speech events, and sentiment expressions.
  • 主观词典 Subjectivity Lexicon
  • Manual Subjectivity Sense Annotations

OpinionFinder的README文件可以看出,OpinionFinder做观点挖掘时执行以下几步:

  1. 预处理 Preprocessing:A set of documents in the docs directory are prepared for processing. XML and HTML meta information is removed.
  2. 句子切分和词性标注 Sentence Splitting and POS Tagging:OpenNLP 1.3.0 is used to sentence split and part-of-speech tag the documents.
  3. 词根处理 Stemming:SCOL, version 1k, Steven Abney's stemmer program is used to stem the documents.
  4. 特征发现 Feature Finder:Clues useful for identifying subjective sentences and sentiment expressions are found in the text document.
  5. Shallow Parsing:SUNDANCE (Sentence UNDerstanding ANd Concept Extraction), a partial parser from the NLP laboratory at the University of Utah, is used by Autoslog-TS to identify extraction patterns needed by the sentence classifiers and the SourceFinder.
  6. 识别观点表达者 SourceFinder:The SourceFinder the extraction patterns from Choi et al. (2005) to mark the sources of private states.
  7. Direct Subjective Expression and Speech Event Classifier:This classifier uses WordNet 1.6 and PyWordNet 1.6. For information about PyWordNet or to download a copy, go to http://osteele.com/projects/pywordnet/
  8. 主观性分类器 Subjectivity Classifier:The subjectivity classifier tags sentences in the document as subjective or objective.
  9. 极性分类器 Polarity Classifier:The polarity classifier tags the words in the document with their contextual polarity.
  10. SGML markup:The MPQA files from the subjectivity, direct subjective expression and speech event, source, and polarity classifiers are written to the output_anns folder with inline SGML markup.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值