spark文本处理-文章分类

这里我采用的还spark来做数据处理以及采用的是spark里面的算法
spark里面提供了词频-逆文本频率(TF-IDF)
它给一个文本的每一个词赋予了一个权值,权值的计算是基于文本中出现的频率,同时采用逆向文本频率做全局归一化。具体的算法推断大家可以去看官网介绍。
分类采用
NaiveBayes来做
我们来看一段数据(需要数据、代码的可以给我留言)
Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.
• Now we are engaged in a great civil war, testing whether that nation, or any nation so
conceived and so dedicated, can long endure. We are met on a great battle-field of that
war. We have come to dedicate a portion of that field, as a final resting place for those who
here gave their lives that that nation might live. It is altogether fitting and proper that we
should do this.
把数据处理
ArrayBuffer(four, score, seven, year, ago, our, father, brought, forth, contin, new, nation,
conceiv, liberti, dedic, proposit, all, men, creat, equal)
• ArrayBuffer(now, we, engag, great, civil, war, test, whether, nation, ani, nation, so, conceiv,
so, dedic, can, long, endur, we, met, great, battl, field, war, we, have, come, dedic, portion,
field, final, rest, place, those, who, here, gave, live, nation, might, live, altogeth, fit, proper,
we, should, do)
(1000, [17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893 ,937,960,990],
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0])
• (1000, [17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449,
460,472,480,498,535,612,676,688,694,706,724,732,790,909,916,939,960, 965,996],
[1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,
1.0,1.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0])
下面我直接代码把
加载数据然后对数据进行HashTF
  val mock = sc.textFile("/zhouxiaoke/mock.tokens")
  val watch = sc.textFile("/zhouxiaoke/watch.tokens")
  val tf = new HashingTF(10000)
  val mockData = mock.map { line =>
    var targert = "1"
    LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
  }
  val watchData = watch.map { line =>
    var targert = "0"
    LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
  }
对数据进行IDF加权,然后数据转化成训练和测试数据
 val idfModel = new IDF(minDocFreq = 3).fit(trainDocs)
    val datasee=splits(0).map{ point=>
      idfModel.transform(point.features).toArray
    }
    datasee.take(100)
   // val eee=idfModel.transform()
    val train = splits(0).map{ point=>
      LabeledPoint(point.label,idfModel.transform(point.features))
    }
    val test = splits(1).map{ point=>
      LabeledPoint(point.label,idfModel.transform(point.features))
    }


    val nbmodel = NaiveBayes.train(train, lambda = 1.0)
    val bayesTrain = train.map(p => (nbmodel.predict(p.features), p.label))
    val bayesTest = test.map(p => (nbmodel.predict(p.features), p.label))


然后查看数据的准确率以及ROC值
 val metricsTrain = new BinaryClassificationMetrics(trainScores,100)
    val metricsTest = new BinaryClassificationMetrics(testScores,100)
    println("RF Training AuROC: ",metricsTrain.areaUnderROC())

    println("RF Test AuROC: ",metricsTest.areaUnderROC())

需要代码合数据可以留邮箱。。。


评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值