MPQA是一个语料库和观点识别系统(Corpus and Opinion Recogntion System)。根据其网页,该系统有下面几部分:
- 观点语料库 MPQA Opinion Corpus:annotated for opinions and sentiments
- OpinionFinder: is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opinion, direct subjective expressions and speech events, and sentiment expressions.
- 主观词典 Subjectivity Lexicon
- Manual Subjectivity Sense Annotations
从OpinionFinder的README文件可以看出,OpinionFinder做观点挖掘时执行以下几步:
- 预处理 Preprocessing:A set of documents in the docs directory are prepared for processing. XML and HTML meta information is removed.
- 句子切分和词性标注 Sentence Splitting and POS Tagging:OpenNLP 1.3.0 is used to sentence split and part-of-speech tag the documents.
- 词根处理 Stemming:SCOL, version 1k, Steven Abney's stemmer program is used to stem the documents.
- 特征发现 Feature Finder:Clues useful for identifying subjective sentences and sentiment expressions are found in the text document.
- Shallow Parsing:SUNDANCE (Sentence UNDerstanding ANd Concept Extraction), a partial parser from the NLP laboratory at the University of Utah, is used by Autoslog-TS to identify extraction patterns needed by the sentence classifiers and the SourceFinder.
- 识别观点表达者 SourceFinder:The SourceFinder the extraction patterns from Choi et al. (2005) to mark the sources of private states.
- Direct Subjective Expression and Speech Event Classifier:This classifier uses WordNet 1.6 and PyWordNet 1.6. For information about PyWordNet or to download a copy, go to http://osteele.com/projects/pywordnet/
- 主观性分类器 Subjectivity Classifier:The subjectivity classifier tags sentences in the document as subjective or objective.
- 极性分类器 Polarity Classifier:The polarity classifier tags the words in the document with their contextual polarity.
- SGML markup:The MPQA files from the subjectivity, direct subjective expression and speech event, source, and polarity classifiers are written to the output_anns folder with inline SGML markup.