Spark-机器学习Spark-TFIDF11

Spark-TFIDF

对情感进行分类

积极文本 net.txt

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 

消极文本 pos.txt

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? 
the ghetto in question is , of course , whitechapel in 1888 london's east end . 

读取数据,对文本进行标签0 1

val random = new Random()
val function= (Filedir: String, tr: Int) => {
  spark.read.textFile(Filedir).map(
    line => {
      (line.split(" ") filter (!_.equals(" ")), tr, random.nextDouble())
    }
  ).toDF("words","values","random")
}
val neg = function("src/main/scala1/coding-271/ch11/sentiment_analysis/neg.txt", 0)
val pos = function("src/main/scala1/coding-271/ch11/sentiment_analysis/pos.txt", 1)
val data = neg.union(pos).sort("random")
data.limit(10).show(false)

文本特征抽取

import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("tf")
  .transform(data)
import org.apache.spark.ml.feature.IDF
val idfModel = new IDF()
  .setInputCol("tf")
  .setOutputCol("tfidf")
  .fit(hashingTF)
val transformedData = idfModel.transform(hashingTF)
val Array(training,test) = transformedData.randomSplit(Array(0.8, 0.2))

根据抽取到的文本特征﹐使用分类器进行分类﹐这是一个二分类问题分米哭是可替换的

val bayes = new NaiveBayes()
  .setFeaturesCol("tfidf") //x
  .setLabelCol("values") //y
  .fit(training)
val result = bayes.transform(test) //交叉验证

对模型准确率进行评估

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("value")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")

val d = evaluator.evaluate(result)
println(d) //准确率
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值