Spark机器学习之垃圾邮件分类

最新推荐文章于 2024-05-30 15:55:39 发布

路人张的鱼生

最新推荐文章于 2024-05-30 15:55:39 发布

阅读量1.2k

点赞数

分类专栏： Spark 机器学习文章标签： Spark

本文链接：https://blog.csdn.net/zhangdy12307/article/details/90768235

版权

Spark 同时被 2 个专栏收录

19 篇文章 1 订阅

订阅专栏

机器学习

10 篇文章 0 订阅

订阅专栏

Spark机器学习之垃圾邮件分类

步骤概述

通过HashingTF构建文本的特征向量，然后使用随机梯度下降算法实现逻辑回归，进而对邮件进行分类

垃圾邮件分类代码

导入相关的包

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

加载文件

val spam = sc.textFile("file://media/hadoop/Ubuntu/spam.txt")
val normal = sc.textFile("file:///media/hadoop/Ubuntu/normal.txt")

其中spam.txt和normal.txt文件如下
spam.txt
在这里插入图片描述
normal.txt

创建一个HashingTF实例把邮件文本映射到包含1000个特征的向量

val tf = new HashingTF(numFeatures = 10000)

将邮件文本切分为单词，将每个单词映射为一个特征

val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal.map(email => tf.transform(email.split(" ")))

创建LabeledPoint数据集分别存放垃圾邮件和正常邮件例子

val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = normalFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples.union(negativeExamples)
trainingData.cache()

使用SGD算法运行逻辑回归

val model = new LogisticRegressionWithSGD().run(trainingData)

使用例子进行测试

val posTest = tf.transform(
"O M G GET cheap stuff by sending money to ...".split(" "))
val negTest = tf.transform(
"Hi Dad, I started studying Spark the other ...".split(" "))
println("Prediction for positive test example: " + model.predict(posTest))
println("Prediction for negative test example: " + model.predict(negTest))

结果如下

在这里插入图片描述

路人张的鱼生

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
Spark机器学习之垃圾邮件分类

Spark机器学习之垃圾邮件分类步骤概述通过HashingTF构建文本的特征向量，然后使用随机梯度下降算法实现逻辑回归，进而对邮件进行分类垃圾邮件分类代码导入相关的包import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.feature.HashingTFimport org...
复制链接

扫一扫

专栏目录