Spark-TFIDF
对情感进行分类
积极文本 net.txt
plot : two teen couples go to a church party , drink and then drive .
they get into an accident .
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
what's the deal ?
watch the movie and " sorta " find out . . .
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly .
they seem to have taken this pretty neat concept , but executed it terribly .
so what are the problems with the movie ?
well , its main problem is that it's simply too jumbled .
消极文本 pos.txt
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
in other words , don't dismiss this film because of its source .
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
the ghetto in question is , of course , whitechapel in 1888 london's east end .
读取数据,对文本进行标签0 1
val random = new Random()
val function= (Filedir: String, tr: Int) => {
spark.read.textFile(Filedir).map(
line => {
(line.split(" ") filter (!_.equals(" ")), tr, random.nextDouble())
}
).toDF("words","values","random")
}
val neg = function("src/main/scala1/coding-271/ch11/sentiment_analysis/neg.txt", 0)
val pos = function("src/main/scala1/coding-271/ch11/sentiment_analysis/pos.txt", 1)
val data = neg.union(pos).sort("random")
data.limit(10).show(false)
文本特征抽取
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("tf")
.transform(data)
import org.apache.spark.ml.feature.IDF
val idfModel = new IDF()
.setInputCol("tf")
.setOutputCol("tfidf")
.fit(hashingTF)
val transformedData = idfModel.transform(hashingTF)
val Array(training,test) = transformedData.randomSplit(Array(0.8, 0.2))
根据抽取到的文本特征﹐使用分类器进行分类﹐这是一个二分类问题分米哭是可替换的
val bayes = new NaiveBayes()
.setFeaturesCol("tfidf") //x
.setLabelCol("values") //y
.fit(training)
val result = bayes.transform(test) //交叉验证
对模型准确率进行评估
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("value")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val d = evaluator.evaluate(result)
println(d) //准确率