Pyspark分类--OneVsRest

最新推荐文章于 2023-09-21 14:05:36 发布

Gadaite

最新推荐文章于 2023-09-21 14:05:36 发布

阅读量313

点赞数 1

分类专栏： ML基础文章标签：分类 big data 数据挖掘

本文链接：https://blog.csdn.net/weixin_46408961/article/details/123415623

版权

ML基础专栏收录该内容

43 篇文章 8 订阅

订阅专栏

OneVsRest

class pyspark.ml.classification.OneVsRest(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, classifier=None, weightCol=None, parallelism=1)

将多类分类简化为二元分类。使用一对一的策略执行减少。对于具有 k 个类的多类分类，训练 k 个模型（每类一个）。每个示例都针对所有 k 个模型进行评分，并选择得分最高的模型来标记示例。

featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘特征列名.’)

labelCol = Param(parent=‘undefined’, name=‘labelCol’, doc=‘标签列名。’)

parallelism = Param(parent=‘undefined’, name=‘parallelism’, doc=‘运行并行算法时使用的线程数 (>= 1).’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘预测列名.’)

weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘权重列名。如果未设置或为空，我们将所有实例权重视为 1.0。’)

classifier = Param(parent=‘undefined’, name=‘classifier’, doc=‘基本二元分类器’)

01.连接数据集，交互式下可以直接读取一下目录下的data/mllib/sample_multiclass_classification_data.txt

(base) [root@localhost mllib]# pwd
/opt/spark-2.4.5-bin-hadoop2.7/data/mllib
(base) [root@localhost mllib]# ls
als              kmeans_data.txt    sample_binary_classification_data.txt       sample_lda_data.txt                sample_movielens_data.txt
gmm_data.txt     pagerank_data.txt  sample_fpgrowth.txt                         sample_lda_libsvm_data.txt         sample_multiclass_classification_data.txt
images           pic_data.txt       sample_isotonic_regression_libsvm_data.txt  sample_libsvm_data.txt             sample_svm_data.txt
iris_libsvm.txt  ridge-data         sample_kmeans_data.txt                      sample_linear_regression_data.txt  streaming_kmeans_data_test.txt

02.本人已上传至HDFS，使用hdfs读取文件，并查看结构

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
    .config("spark.ui.showConsoleProgress","false").appName("OneVsRest")\
    .master("local[*]").getOrCreate()
hdfs_path = "hdfs://192.168.1.10:9000//HadoopFileS/builtInSpark/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(hdfs_path)
df.show()
df.printSchema()
print(df.head(2))

输出结果：

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,1,2,3],[0.1...|
|  1.0|(4,[0,2,3],[-0.83...|
|  2.0|(4,[0,1,2,3],[-1....|
|  2.0|(4,[0,1,2,3],[-1....|
|  1.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,2,3],[0.611...|
|  0.0|(4,[0,1,2,3],[0.2...|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,2,3],[-0.94...|
|  2.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,1,2,3],[0.1...|
|  2.0|(4,[0,1,2,3],[-0....|
+-----+--------------------+
only showing top 20 rows

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

[Row(label=1.0, features=SparseVector(4, {0: -0.2222, 1: 0.5, 2: -0.7627, 3: -0.8333})), 
 Row(label=1.0, features=SparseVector(4, {0: -0.5556, 1: 0.25, 2: -0.8644, 3: -0.9167}))]

03.选择逻辑回归分类器作为基本二元分类器，且逻辑回归分类器的正则化参数设置为0.01，并构建模型

lr = LogisticRegression(regParam=0.01)
ovr = OneVsRest(classifier=lr)
model = ovr.fit(df）

04.查看得到的所有模型

print(model.models)

输出结果：

[LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4]

05.查看模型系数

print([model.models[i].coefficients for i in range(0, len(model.models))])

输出结果：

[DenseVector([0.5152, -1.09, 3.4683, 4.246]), 
 DenseVector([-2.1282, 3.1284, -2.6819, -2.3445]), 
 DenseVector([0.3064, -3.4213, 1.0461, -1.1383])]

06.查看模型截距

print([model.models[i].intercept  for i in range(0, len(model.models))])

输出结果：

[-2.738074251493597, -2.564091464028484, -1.3244853130179222]

07.生成测试数据

sc = spark.sparkContext
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0, 1.0, 1.0))]).toDF()
model.transform(test0).show()

输出结果：

+------------------+----------+
|          features|prediction|
+------------------+----------+
|[-1.0,0.0,1.0,1.0]|       0.0|
+------------------+----------+

Gadaite

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Pyspark分类--OneVsRest

OneVsRestclass pyspark.ml.classification.OneVsRest(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, classifier=None, weightCol=None, parallelism=1)将多类分类简化为二元分类。使用一对一的策略执行减少。对于具有 k 个类的多类分类，训练 k 个模型（每类一个）。每个示例都针对所有 k 个模型进行评分，并选择得分最高
复制链接

扫一扫