【PMI-IR】Semantic Orientation Applied to Unsupervised Classification of Reviews

**

Thumbs up or thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews.[PMI-IR–2002.7]

**
In this paper, author proposed a simple unsupervised learning algorithm PMI-IR to classifying a review as recommended or not recommended. PMI-IR uses Pointwise Mutual Information(PMI) and Information Retrieval(IR) to measure similarity of pairs of words or phrases. It has three steps: first is to use part-of-speech tagger to identify and extract phrases containing adjectives or adverbs. Then to estimate the semantic orientation of each phrase(PMI->SO(IR)). Finally is to classify the review based on the average semantic orientation of the phrases(average).

It shows that this algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from 4 different domains(automobiles, bank, travel destination and movies). From this 4 different domains, automobiels and banks has a better performance, with the accuracy of 84% and 80% respectively. While movies has the worse performance, and the explaination is sometimes positive reviews mention unpleasant things and negative reviews often mention pleasant things because of different plots.

The advantage of this algorithm is that it can improve the accuracy, but it also has some limitations. First one is that it is time consuming since it need to send queries to AltaVista(now it is shut down). And second is to some specific domain, this algorithm cannot perform that well, maybe in the future it is a good idea to combine semantic orientation with other features in a supervised classification algorithm.

文中的术语/方法:

  • LSA (Latent Semantic Analysis): 潜在语义分析,它使用统计计算的方法对大量的文本集进行分析,从而提取出词与词之间潜在的语义结构,并用这种潜在的语义结构,来表示词和文本,达到消除词之间的相关性和简化文本向量实现降维的目的,即把高维的向量空间模型(VSM)表示中的文档映射到低维的潜在语义空间中。这个映射是通过对项/文档矩阵的奇异值分解(SVD)来实现的。
    该方法和传统向量空间模型(vector space model)一样使用向量来表示词(terms)和文档(documents),并通过向量间的关系(如余弦相似度)来判断词及文档间的关系;而不同的是,LSA将词和文档映射到潜在语义空间,从而去除了原始向量空间中的一些“噪音”,提高了信息检索的精确度。
    步骤:
    (1)分析文档集合,建立term-document矩阵
    (2)奇异值分解该矩阵
    (3)分解后的矩阵进行降维
    (4)用将为后的矩阵来构建潜在语义空间,或重建term-document矩阵

  • PMI-IR
    在这里插入图片描述
    PMI部分跟正常PMI相同(详细介绍在SO-PMI部分),但是SO是通过IR确定的,之前的SO是通过考虑不同情感词之间的共现距离(正负词是事先指定好的种子词),而此文是通过搜索引擎返回的结果来计算某一情感要素词和一个参考词(正,负;事先指定种子词)的差值来测定词语的语义倾向性:
    在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值