Pyspark分类--OneVsRest

OneVsRest

class pyspark.ml.classification.OneVsRest(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, classifier=None, weightCol=None, parallelism=1)

将多类分类简化为二元分类。 使用一对一的策略执行减少。 对于具有 k 个类的多类分类,训练 k 个模型(每类一个)。 每个示例都针对所有 k 个模型进行评分,并选择得分最高的模型来标记示例。

featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘特征列名.’)

labelCol = Param(parent=‘undefined’, name=‘labelCol’, doc=‘标签列名。’)

parallelism = Param(parent=‘undefined’, name=‘parallelism’, doc=‘运行并行算法时使用的线程数 (>= 1).’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘预测列名.’)

weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘权重列名。如果未设置或为空,我们将所有实例权重视为 1.0。’)

classifier = Param(parent=‘undefined’, name=‘classifier’, doc=‘基本二元分类器’)

01.连接数据集,交互式下可以直接读取一下目录下的data/mllib/sample_multiclass_classification_data.txt

(base) [root@localhost mllib]# pwd
/opt/spark-2.4.5-bin-hadoop2.7/data/mllib
(base) [root@localhost mllib]# ls
als              kmeans_data.txt    sample_binary_classification_data.txt       sample_lda_data.txt                sample_movielens_data.txt
gmm_data.txt     pagerank_data.txt  sample_fpgrowth.txt                         sample_lda_libsvm_data.txt         sample_multiclass_classification_data.txt
images           pic_data.txt       sample_isotonic_regression_libsvm_data.txt  sample_libsvm_data.txt             sample_svm_data.txt
iris_libsvm.txt  ridge-data         sample_kmeans_data.txt                      sample_linear_regression_data.txt  streaming_kmeans_data_test.txt

02.本人已上传至HDFS,使用hdfs读取文件,并查看结构

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
    .config("spark.ui.showConsoleProgress","false").appName("OneVsRest")\
    .master("local[*]").getOrCreate()
hdfs_path = "hdfs://192.168.1.10:9000//HadoopFileS/builtInSpark/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(hdfs_path)
df.show()
df.printSchema()
print(df.head(2))

​ 输出结果:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,1,2,3],[0.1...|
|  1.0|(4,[0,2,3],[-0.83...|
|  2.0|(4,[0,1,2,3],[-1....|
|  2.0|(4,[0,1,2,3],[-1....|
|  1.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,2,3],[0.611...|
|  0.0|(4,[0,1,2,3],[0.2...|
|  1.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  2.0|(4,[0,1,2,3],[-0....|
|  1.0|(4,[0,2,3],[-0.94...|
|  2.0|(4,[0,1,2,3],[-0....|
|  0.0|(4,[0,1,2,3],[0.1...|
|  2.0|(4,[0,1,2,3],[-0....|
+-----+--------------------+
only showing top 20 rows

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

[Row(label=1.0, features=SparseVector(4, {0: -0.2222, 1: 0.5, 2: -0.7627, 3: -0.8333})), 
 Row(label=1.0, features=SparseVector(4, {0: -0.5556, 1: 0.25, 2: -0.8644, 3: -0.9167}))]

03.选择逻辑回归分类器作为基本二元分类器,且逻辑回归分类器的正则化参数设置为0.01,并构建模型

lr = LogisticRegression(regParam=0.01)
ovr = OneVsRest(classifier=lr)
model = ovr.fit(df)

04.查看得到的所有模型

print(model.models)

​ 输出结果:

[LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4]

05.查看模型系数

print([model.models[i].coefficients for i in range(0, len(model.models))])

​ 输出结果:

[DenseVector([0.5152, -1.09, 3.4683, 4.246]), 
 DenseVector([-2.1282, 3.1284, -2.6819, -2.3445]), 
 DenseVector([0.3064, -3.4213, 1.0461, -1.1383])]

06.查看模型截距

print([model.models[i].intercept  for i in range(0, len(model.models))])

​ 输出结果:

[-2.738074251493597, -2.564091464028484, -1.3244853130179222]

07.生成测试数据

sc = spark.sparkContext
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0, 1.0, 1.0))]).toDF()
model.transform(test0).show()

​ 输出结果:

+------------------+----------+
|          features|prediction|
+------------------+----------+
|[-1.0,0.0,1.0,1.0]|       0.0|
+------------------+----------+
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值