PySpark机器学习案例--分类与聚类

最新推荐文章于 2024-07-31 15:08:57 发布

Hadoop_Liang

最新推荐文章于 2024-07-31 15:08:57 发布

阅读量4.4k

点赞数 2

分类专栏： spark 算法文章标签：机器学习 pyspark MLlib ML

本文链接：https://blog.csdn.net/qq_42881421/article/details/90350068

版权

spark 同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

算法

2 篇文章 0 订阅

订阅专栏

案例一:基于逻辑回归算法的机器学习（分类）

要求：text含有“spark”的 lable 标记为1，否则标记为0

训练数据集：

# 训练数据
id	text	 		label
0	"a b c d e spark"	1.0
1	"b d"			0.0
2	"spark f g h"		1.0
3	"hadoop mapreduce"	0.0

测试数据集：

# 测试数据
id	text	
4	"spark i j k"
5	"l m n"	
6	"spark hadoop spark"
7	"apache hadoop"

可以看到：

训练数据集包含有id，text，label三个字段；

测试数据集只有id，text两个字段，没有的label字段就是要我们用机器学习方法预测出来的值，具体方式如下：

1.用训练数据通过逻辑回归算法，训练出模型（经验）

2. 测试数据经过模型处理得到预测值

完整代码：

example1.py

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# 程序实现识别文本行中是否包含spark，如果包含标记列为1，否则为0

# 训练数据集
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# 配置一个pipeline管道，包括转换器和评估器
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# 用管道训练出模型
# Fit the pipeline to training documents.
model = pipeline.fit(training)

# 测试数据，测试数据没有label标签
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# 用测试数据来预测，
# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
# 打印出感兴趣的列
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

注意：运行该项目需要先安装numpy

运行方法：

至少有3种运行方式

1. 如果是linux终端下运行，则输入以下命令运行（推荐使用）

$ spark-submit example1.py

或者

$ python3 example1.py

2. 如果是在pycharm里编程，配置好spark相关文件后，则直接右键运行

3. 如果是在pyspark命令行下，那么就逐行输入，依次运行，需要点耐心和细心。

运行结果：

(4, spark i j k) --> prob=[0.1596407738787475,0.8403592261212525], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000

从运行结果来看，符合我们的预期，即包含spark的行预测值为1，否者为0。

案例二：基于K-Means算法的机器学习案例（聚类）

要求：对空间中的6个点进行K-Means聚类分析，K=2，得到2个中心点坐标

数据集的数据格式为支持向量机数据（libsvm），如下：

id	Feature
0 	1:0.0 2:0.0 3:0.0
1 	1:0.1 2:0.1 3:0.1
2 	1:0.2 2:0.2 3:0.2
3 	1:9.0 2:9.0 3:9.0
4 	1:9.1 2:9.1 3:9.1
5 	1:9.2 2:9.2 3:9.2

从数据集可以画出三维坐标轴中的6个点

预想的结果是通过数据集训练出一个模型，这个模型可以自动找出2个簇的中心点坐标，如下：

完整代码：

example2.py

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# 程序主要完成对空间点的聚类，6个空间点x,y,z坐标如下
# 0 1:0.0 2:0.0 3:0.0
# 1 1:0.1 2:0.1 3:0.1
# 2 1:0.2 2:0.2 3:0.2
# 3 1:9.0 2:9.0 3:9.0
# 4 1:9.1 2:9.1 3:9.1
# 5 1:9.2 2:9.2 3:9.2

# Loads data. 注意修改txt文件路径
dataset = spark.read.format("libsvm").load("data.txt")

# 训练一个kmeans模型，将K设置为2，将分为两个簇
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# 预测 Make predictions
predictions = model.transform(dataset)

# 评估：创建评估器对象及评估 Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# 打印中心点Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

准备数据

$ nano data.txt

内容如下：

0 1:0.0 2:0.0 3:0.0
1 1:0.1 2:0.1 3:0.1
2 1:0.2 2:0.2 3:0.2
3 1:9.0 2:9.0 3:9.0
4 1:9.1 2:9.1 3:9.1
5 1:9.2 2:9.2 3:9.2

运行

$ spark-submit example2.py

结果：

Silhouette with squared euclidean distance = 0.9997530305375207
Cluster Centers:
[0.1 0.1 0.1]
[9.1 9.1 9.1]

完成！enjoy it！

Hadoop_Liang

关注

2
点赞
踩
37

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录