2024年最全【Spark ML】（二）Spark ML 分类算法_spark分类算法(3)

2401_84181626

于 2024-05-03 02:00:17 发布

阅读量321

点赞数 3

分类专栏：程序员文章标签： spark-ml 分类 spark

本文链接：https://blog.csdn.net/2401_84181626/article/details/138405333

版权

程序员专栏收录该内容

153 篇文章 0 订阅

订阅专栏

falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

spark.stop()


结果如下：

Coefficients:
3 X 4 CSRMatrix
(0,3) 0.3176
(1,2) -0.7804
(1,3) -0.377
Intercept: [0.05165231659832854,-0.12391224990853622,0.07225993331020768]
objectiveHistory:
1.09861228867
…
False positive rate by label:
label 0: 0.22
label 1: 0.05
label 2: 0.0
True positive rate by label:
label 0: 1.0
label 1: 1.0
label 2: 0.46
Precision by label:
label 0: 0.694444444444
label 1: 0.909090909091
label 2: 1.0
Recall by label:
label 0: 1.0
label 1: 1.0
label 2: 0.46
F-measure by label:
label 0: 0.819672131148
label 1: 0.952380952381
label 2: 0.630136986301
Accuracy: 0.82
FPR: 0.09
TPR: 0.82
F-measure: 0.800730023277
Precision: 0.867845117845
Recall: 0.82


### 2. 决策树分类器


举例


以LibSVM格式加载数据集，将其拆分为训练集和测试集，在第一个数据集上训练，然后在保留的测试集上进行评估。 我们使用两个特征变换器(transformers)来准备数据; 这些帮助标记和分类特征的索引类别，向决策树算法可识别的DataFrame添加元数据。

-- coding: utf-8 --

from future import print_function
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession
.builder
.appName(“DecisionTreeClassificationExample”)
.getOrCreate()

data = spark.read.format("libsvm").load("../data/mllib/sample\_libsvm\_data.txt")

# 对于整个数据集，将label转换为索引
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# 自动识别数据集中的分类特征，并且进行矢量化处理;设定maxCategories，以便将具有> 4个不同值的特性视为连续的。
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# 切分训练集和测试集
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# 训练一颗决策树
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# 连接indexers和决策树
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

model = pipeline.fit(trainingData)

# 进行预测
predictions = model.transform(testData)
predictions.select("prediction", "indexedLabel", "features").show(5)

# 计算测试误差
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
print(treeModel)

spark.stop()


结果如下：

±---------±-----------±-------------------+
|prediction|indexedLabel| features|
±---------±-----------±-------------------+
| 1.0| 1.0|(692,[98,99,100,1…|
| 1.0| 1.0|(692,[121,122,123…|
| 1.0| 1.0|(692,[122,123,148…|
| 1.0| 1.0|(692,[124,125,126…|
| 1.0| 1.0|(692,[126,127,128…|
±---------±-----------±-------------------+
only showing top 5 rows

Test Error = 0.0357143
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4f508c37c4be93461970) of depth 1 with 3 nodes


### 3. 随机森林分类器


与DT类似的，只不过选择RF来进行训练，示例代码如下：

-- coding: utf-8 --

from future import print_function
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession
.builder
.appName(“RandomForestClassifierExample”)
.getOrCreate()

# 处理方式如DT类似
data = spark.read.format("libsvm").load("../data/mllib/sample\_libsvm\_data.txt")
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# TRAIN RF
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# 将标签的索引转换为原始标签
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

# 在Pipeline中进行整个训练流程
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

model = pipeline.fit(trainingData)

predictions = model.transform(testData)

predictions.select("predictedLabel", "label", "features").show(5)

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", 
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

rfModel = model.stages[2]
print(rfModel)

spark.stop()


结果如下：

±-------------±----±-------------------+
|predictedLabel|label| features|
±-------------±----±-------------------+
| 0.0| 0.0|(692,[98,99,100,1…|
| 0.0| 0.0|(692,[122,123,148…|
| 0.0| 0.0|(692,[124,125,126…|
| 0.0| 0.0|(692,[124,125,126…|
| 0.0| 0.0|(692,[124,125,126…|
±-------------±----±-------------------+
only showing top 5 rows

Test Error = 0.0294118
RandomForestClassificationModel (uid=RandomForestClassifier_421b9fdfb8d0ee9acde3) with 10 trees


### 4. 梯度提升树分类器


如前文类似，选用梯度提升树（Gradient-boosted trees, GBTs）来进行训练，示例代码如下：

-- coding: utf-8 --

from future import print_function
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession
.builder
.appName(“GradientBoostedTreeClassifierExample”)
.getOrCreate()

data = spark.read.format("libsvm").load("../data/mllib/sample\_libsvm\_data.txt")
labelIndex = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# train
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=10)

# 在管道中进行整个训练流程
pipeline = Pipeline(stages=[labelIndex, featureIndexer, gbt])
model = pipeline.fit(trainingData)

# 预测
predictions = model.transform(testData)
predictions.select("prediction", "indexedLabel", "features").show(5)

# 计算测试误差
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)

spark.stop()


结果如下：

±---------±-----------±-------------------+
|prediction|indexedLabel| features|
±---------±-----------±-------------------+
| 1.0| 1.0|(692,[95,96,97,12…|
| 1.0| 1.0|(692,[100,101,102…|
| 1.0| 1.0|(692,[122,123,148…|
| 1.0| 1.0|(692,[123,124,125…|
| 1.0| 1.0|(692,[124,125,126…|
±---------±-----------±-------------------+
only showing top 5 rows

Test Error = 0
GBTClassificationModel (uid=GBTClassifier_4a1fa549ada75fa70795) with 20 trees


### 5. 多层感知器分类器


多层感知器分类器（Multilayer perceptron classifier, MLPC）是基于前馈人工神经网络的分类器。 MLPC由多层节点组成。 每层完全连接到网络中的下一层。 输入层中的节点表示输入数据。 所有其他节点通过输入与节点权重www和偏差bbb的线性组合将输入映射到输出，并应用激活函数。 这可以用矩阵形式写入MLPC，K+1层如下：  
 ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200314153637680.png#pic_center)  
 中间层中的节点使用sigmoid函数：


![在这里插入图片描述](https://img-blog.csdnimg.cn/20200314153650701.png#pic_center)  
 输出层中的节点使用softmax函数：  
 ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200314153705547.png#pic_center)  
 输出层中的节点数N对应于类的数量。


MLPC采用反向传播来学习模型。 我们使用逻辑损失函数进行优化，使用L-BFGS作为优化过程。


示例代码如下：

-- coding: utf-8 --

from future import print_function
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession.builder.appName(“multilayer_perceptron_classification_example”).getOrCreate()

data = spark.read.format("libsvm").load("../data/mllib/sample\_multiclass\_classification\_data.txt")
(train_data, test_data) = data.randomSplit([0.6, 0.4], seed=2019)

# 输入层为features的大小(4)，输出层为labels的大小(3)
layers = [4, 5, 4, 3]

# train
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=2019)
model = trainer.fit(train_data)

# 计算在测试集上的准确率
predictions = model.transform(test_data)
predictions.select("prediction", "label", "features").show(5)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print(model)

spark.stop()


结果如下：

±---------±----±-------------------+
|prediction|label| features|
±---------±----±-------------------+
| 0.0| 0.0|(4,[0,1,2,3],[-0…|
| 0.0| 0.0|(4,[0,1,2,3],[-0…|
| 0.0| 0.0|(4,[0,1,2,3],[0.0…|
| 0.0| 0.0|(4,[0,1,2,3],[0.0…|
| 0.0| 0.0|(4,[0,1,2,3],[0.1…|
±---------±----±-------------------+
only showing top 5 rows

Test Error = 0.0172414
MultilayerPerceptronClassifier_4f01847fd0f3f4531e41


### 6. 线性支持向量机


示例代码如下：

-- coding: utf-8 --

from future import print_function
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession
.builder
.appName(“linearSVC Example”)
.getOrCreate()

data = spark.read.format("libsvm").load("../data/mllib/sample\_libsvm\_data.txt")

(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
lsvc = LinearSVC(maxIter=10, regParam=0.1)
lsvcModel = lsvc.fit(trainingData)

# print("Coefficients: " + str(lsvcModel.coefficients))
# print("Intercept: " + str(lsvcModel.intercept))

# 计算在测试集上的准确率
predictions = lsvcModel.transform(testData)
predictions.select("prediction", "label", "features").show(5)
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", 
                                          metricName="areaUnderROC")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print(lsvcModel)

spark.stop()


结果如下：

±---------±----±-------------------+
|prediction|label| features|
±---------±----±-------------------+
| 0.0| 0.0|(692,[100,101,102…|
| 0.0| 0.0|(692,[121,122,123…|
| 0.0| 0.0|(692,[124,125,126…|
| 0.0| 0.0|(692,[124,125,126…|
| 0.0| 0.0|(692,[124,125,126…|
±---------±----±-------------------+
only showing top 5 rows

Test Error = 0
LinearSVC_409bb95a7222b3ec2faa


### 7. One-vs-Rest分类器


OneVsRest作为Estimator实现。 对于基类分类器，它接受分类器的实例，并为每个k类创建二进制分类问题。 训练i类的分类器来预测标签是否为i，将类i与所有其他类区分开来。通过评估每个二元分类器来完成预测，并且将自信(most confident)的分类器的索引输出为标签。


示例代码如下：

-- coding: utf-8 --

from future import print_function
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession

if name == “__main__”:
spark = SparkSession.builder.appName(“OneVsRestExample”).getOrCreate()

inputData = spark.read.format("libsvm").load("../data/mllib/sample\_multiclass\_classification\_data.txt")

(train, test) = inputData.randomSplit([0.8, 0.2], seed=2019)

# 创建一个分类器
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
# 实例化One Vs Rest分类器
ovr = OneVsRest(classifier=lr)
ovrModel = ovr.fit(train)

predictions = ovrModel.transform(test)
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

spark.stop()



![img](https://img-blog.csdnimg.cn/img_convert/79b33db47c0c7d1ecc656f1076ad4e2e.png)
![img](https://img-blog.csdnimg.cn/img_convert/b4c8bbc2fe8e34aeb7fb0911c61d6ead.png)
![img](https://img-blog.csdnimg.cn/img_convert/6c0f333910779468fe9dc1df9acdf2f2.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！**

**由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新**

**[需要这份系统化资料的朋友，可以戳这里获取](https://bbs.csdn.net/topics/618545628)**

tricName="accuracy")
    accuracy = evaluator.evaluate(predictions)
    print("Test Error = %g" % (1.0 - accuracy))

    spark.stop()

[外链图片转存中…(img-nIRm6j0i-1714672773564)]
[外链图片转存中…(img-G9OVQIwg-1714672773565)]
[外链图片转存中…(img-sHFKfxY4-1714672773565)]

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

2401_84181626

关注

3
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
2024年最全【Spark ML】（二）Spark ML 分类算法_spark分类算法(3)

Coefficients:3 X 4 CSRMatrix(0,3) 0.3176(1,2) -0.7804(1,3) -0.377Intercept: [0.05165231659832854,-0.12391224990853622,0.07225993331020768]objectiveHistory:1.09861228867…False positive rate by label:label 0: 0.22label 1: 0.05label 2: 0.0True po
复制链接

扫一扫