【PMI-IR】Semantic Orientation Applied to Unsupervised Classification of Reviews_thumbs up or thumbs down? semantic orientation app-CSDN博客

本文链接：https://blog.csdn.net/weixin_43210889/article/details/89419914

Thumbs up or thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews.[PMI-IR–2002.7]

**
In this paper, author proposed a simple unsupervised learning algorithm PMI-IR to classifying a review as recommended or not recommended. PMI-IR uses Pointwise Mutual Information(PMI) and Information Retrieval(IR) to measure similarity of pairs of words or phrases. It has three steps: first is to use part-of-speech tagger to identify and extract phrases containing adjectives or adverbs. Then to estimate the semantic orientation of each phrase(PMI->SO(IR)). Finally is to classify the review based on the average semantic orientation of the phrases(average).

It shows that this algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from 4 different domains(automobiles, bank, travel destination and movies). From this 4 different domains, automobiels and banks has a better performance, with the accuracy of 84% and 80% respectively. While movies has the worse performance, and the explaination is sometimes positive reviews mention unpleasant things and negative reviews often mention pleasant things because of different plots.

The advantage of this algorithm is that it can improve the accuracy, but it also has some limitations. First one is that it is time consuming since it need to send queries to AltaVista(now it is shut down). And second is to some specific domain, this algorithm cannot perform that well, maybe in the future it is a good idea to combine semantic orientation with other features in a supervised classification algorithm.

文中的术语/方法：

LSA (Latent Semantic Analysis): 潜在语义分析，它使用统计计算的方法对大量的文本集进行分析，从而提取出词与词之间潜在的语义结构，并用这种潜在的语义结构，来表示词和文本，达到消除词之间的相关性和简化文本向量实现降维的目的，即把高维的向量空间模型（VSM）表示中的文档映射到低维的潜在语义空间中。这个映射是通过对项/文档矩阵的奇异值分解（SVD）来实现的。
该方法和传统向量空间模型(vector space model)一样使用向量来表示词(terms)和文档(documents)，并通过向量间的关系(如余弦相似度)来判断词及文档间的关系；而不同的是，LSA将词和文档映射到潜在语义空间，从而去除了原始向量空间中的一些“噪音”，提高了信息检索的精确度。
步骤：
(1)分析文档集合，建立term-document矩阵
(2)奇异值分解该矩阵
(3)分解后的矩阵进行降维
(4)用将为后的矩阵来构建潜在语义空间，或重建term-document矩阵
PMI-IR

PMI部分跟正常PMI相同(详细介绍在SO-PMI部分)，但是SO是通过IR确定的，之前的SO是通过考虑不同情感词之间的共现距离（正负词是事先指定好的种子词），而此文是通过搜索引擎返回的结果来计算某一情感要素词和一个参考词（正，负；事先指定种子词）的差值来测定词语的语义倾向性：