TEXT CLASSIFICATION FOR SENTIMENT ANALYSIS – NAIVE BAYES CLASSIFIER

转自:http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True. Here’s the feature extraction method:

def word_feats(words):
        return dict([(word, True) for word in words])

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here’s the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
    return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0
As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are “vulnerable” and “avoids”. Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we’re able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

这里是一个 RDD-based API 的简单 Scala 代码示例,展示如何使用 Apache Spark 中的 MLlib 库来训练一个朴素贝叶斯分类器并输出结果: ```scala import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.util.MLUtils import org.apache.spark.{SparkConf, SparkContext} // 创建 SparkConf 和 SparkContext val conf = new SparkConf().setAppName("NaiveBayesExample") val sc = new SparkContext(conf) // 读取数据集 val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // 划分数据集为训练集和测试集 val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L) // 训练朴素贝叶斯分类器 val model = NaiveBayes.train(trainingData, lambda = 1.0, modelType = "multinomial") // 在测试集上进行预测 val predictions = testData.map { point => val prediction = model.predict(point.features) (prediction, point.label) } // 计算模型准确率 val accuracy = 1.0 * predictions.filter(r => r._1 == r._2).count() / testData.count() println(s"Test set accuracy = $accuracy") // 输出预测结果 predictions.take(10).foreach(println) // 保存模型 model.save(sc, "myModelPath") // 加载模型 val sameModel = NaiveBayesModel.load(sc, "myModelPath") ``` 你可以将你的数据集替换文中的数据集路径,然后运行代码即可得到训练好的模型和预测结果。需要注意的是,在 RDD-based API 中,我们需要显式地调用 `predict` 方法来进行预测。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值