这里我采用的还spark来做数据处理以及采用的是spark里面的算法
spark里面提供了词频-逆文本频率(TF-IDF)
它给一个文本的每一个词赋予了一个权值,权值的计算是基于文本中出现的频率,同时采用逆向文本频率做全局归一化。具体的算法推断大家可以去看官网介绍。
分类采用
NaiveBayes来做
我们来看一段数据(需要数据、代码的可以给我留言)
Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.
• Now we are engaged in a great civil war, testing whether that nation, or any nation so
conceived and so dedicated, can long endure. We are met on a great battle-field of that
war. We have come to dedicate a portion of that field, as a final resting place for those who
here gave their lives that that nation might live. It is altogether fitting and proper that we
should do this.
把数据处理
ArrayBuffer(four, score, seven, year, ago, our, father, brought, forth, contin, new, nation,
conceiv, liberti, dedic, proposit, all, men, creat, equal)
• ArrayBuffer(now, we, engag, great, civil, war, test, whether, nation, ani, nation, so, conceiv,
so, dedic, can, long, endur, we, met, great, battl, field, war, we, have, come, dedic, portion,
field, final, rest, place, those, who, here, gave, live, nation, might, live, altogeth, fit, proper,
we, should, do)
(1000, [17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893 ,937,960,990],
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0])
• (1000, [17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449,
460,472,480,498,535,612,676,688,694,706,724,732,790,909,916,939,960, 965,996],
[1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,
1.0,1.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0])
下面我直接代码把
加载数据然后对数据进行HashTF
val mock = sc.textFile("/zhouxiaoke/mock.tokens")
val watch = sc.textFile("/zhouxiaoke/watch.tokens")
val tf = new HashingTF(10000)
val mockData = mock.map { line =>
var targert = "1"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
val watchData = watch.map { line =>
var targert = "0"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
对数据进行IDF加权,然后数据转化成训练和测试数据
val idfModel = new IDF(minDocFreq = 3).fit(trainDocs)
val datasee=splits(0).map{ point=>
idfModel.transform(point.features).toArray
}
datasee.take(100)
// val eee=idfModel.transform()
val train = splits(0).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}
val test = splits(1).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}
val nbmodel = NaiveBayes.train(train, lambda = 1.0)
val bayesTrain = train.map(p => (nbmodel.predict(p.features), p.label))
val bayesTest = test.map(p => (nbmodel.predict(p.features), p.label))
然后查看数据的准确率以及ROC值
val metricsTrain = new BinaryClassificationMetrics(trainScores,100)
val metricsTest = new BinaryClassificationMetrics(testScores,100)
println("RF Training AuROC: ",metricsTrain.areaUnderROC())
spark里面提供了词频-逆文本频率(TF-IDF)
它给一个文本的每一个词赋予了一个权值,权值的计算是基于文本中出现的频率,同时采用逆向文本频率做全局归一化。具体的算法推断大家可以去看官网介绍。
分类采用
NaiveBayes来做
我们来看一段数据(需要数据、代码的可以给我留言)
Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.
• Now we are engaged in a great civil war, testing whether that nation, or any nation so
conceived and so dedicated, can long endure. We are met on a great battle-field of that
war. We have come to dedicate a portion of that field, as a final resting place for those who
here gave their lives that that nation might live. It is altogether fitting and proper that we
should do this.
把数据处理
ArrayBuffer(four, score, seven, year, ago, our, father, brought, forth, contin, new, nation,
conceiv, liberti, dedic, proposit, all, men, creat, equal)
• ArrayBuffer(now, we, engag, great, civil, war, test, whether, nation, ani, nation, so, conceiv,
so, dedic, can, long, endur, we, met, great, battl, field, war, we, have, come, dedic, portion,
field, final, rest, place, those, who, here, gave, live, nation, might, live, altogeth, fit, proper,
we, should, do)
(1000, [17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893 ,937,960,990],
[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0])
• (1000, [17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449,
460,472,480,498,535,612,676,688,694,706,724,732,790,909,916,939,960, 965,996],
[1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,
1.0,1.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0])
下面我直接代码把
加载数据然后对数据进行HashTF
val mock = sc.textFile("/zhouxiaoke/mock.tokens")
val watch = sc.textFile("/zhouxiaoke/watch.tokens")
val tf = new HashingTF(10000)
val mockData = mock.map { line =>
var targert = "1"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
val watchData = watch.map { line =>
var targert = "0"
LabeledPoint(targert.toDouble, tf.transform(line.split(",")))
}
对数据进行IDF加权,然后数据转化成训练和测试数据
val idfModel = new IDF(minDocFreq = 3).fit(trainDocs)
val datasee=splits(0).map{ point=>
idfModel.transform(point.features).toArray
}
datasee.take(100)
// val eee=idfModel.transform()
val train = splits(0).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}
val test = splits(1).map{ point=>
LabeledPoint(point.label,idfModel.transform(point.features))
}
val nbmodel = NaiveBayes.train(train, lambda = 1.0)
val bayesTrain = train.map(p => (nbmodel.predict(p.features), p.label))
val bayesTest = test.map(p => (nbmodel.predict(p.features), p.label))
然后查看数据的准确率以及ROC值
val metricsTrain = new BinaryClassificationMetrics(trainScores,100)
val metricsTest = new BinaryClassificationMetrics(testScores,100)
println("RF Training AuROC: ",metricsTrain.areaUnderROC())
println("RF Test AuROC: ",metricsTest.areaUnderROC())
需要代码合数据可以留邮箱。。。