OneVsRest
class pyspark.ml.classification.OneVsRest(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, classifier=None, weightCol=None, parallelism=1)
将多类分类简化为二元分类。 使用一对一的策略执行减少。 对于具有 k 个类的多类分类,训练 k 个模型(每类一个)。 每个示例都针对所有 k 个模型进行评分,并选择得分最高的模型来标记示例。
featuresCol = Param(parent=‘undefined’, name=‘featuresCol’, doc=‘特征列名.’)
labelCol = Param(parent=‘undefined’, name=‘labelCol’, doc=‘标签列名。’)
parallelism = Param(parent=‘undefined’, name=‘parallelism’, doc=‘运行并行算法时使用的线程数 (>= 1).’)
predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘预测列名.’)
weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘权重列名。如果未设置或为空,我们将所有实例权重视为 1.0。’)
classifier = Param(parent=‘undefined’, name=‘classifier’, doc=‘基本二元分类器’)
01.连接数据集,交互式下可以直接读取一下目录下的data/mllib/sample_multiclass_classification_data.txt
(base) [root@localhost mllib]# pwd
/opt/spark-2.4.5-bin-hadoop2.7/data/mllib
(base) [root@localhost mllib]# ls
als kmeans_data.txt sample_binary_classification_data.txt sample_lda_data.txt sample_movielens_data.txt
gmm_data.txt pagerank_data.txt sample_fpgrowth.txt sample_lda_libsvm_data.txt sample_multiclass_classification_data.txt
images pic_data.txt sample_isotonic_regression_libsvm_data.txt sample_libsvm_data.txt sample_svm_data.txt
iris_libsvm.txt ridge-data sample_kmeans_data.txt sample_linear_regression_data.txt streaming_kmeans_data_test.txt
02.本人已上传至HDFS,使用hdfs读取文件,并查看结构
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
.config("spark.ui.showConsoleProgress","false").appName("OneVsRest")\
.master("local[*]").getOrCreate()
hdfs_path = "hdfs://192.168.1.10:9000//HadoopFileS/builtInSpark/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(hdfs_path)
df.show()
df.printSchema()
print(df.head(2))
输出结果:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 1.0|(4,[0,1,2,3],[-0....|
| 1.0|(4,[0,1,2,3],[-0....|
| 1.0|(4,[0,1,2,3],[-0....|
| 1.0|(4,[0,1,2,3],[-0....|
| 0.0|(4,[0,1,2,3],[0.1...|
| 1.0|(4,[0,2,3],[-0.83...|
| 2.0|(4,[0,1,2,3],[-1....|
| 2.0|(4,[0,1,2,3],[-1....|
| 1.0|(4,[0,1,2,3],[-0....|
| 0.0|(4,[0,2,3],[0.611...|
| 0.0|(4,[0,1,2,3],[0.2...|
| 1.0|(4,[0,1,2,3],[-0....|
| 1.0|(4,[0,1,2,3],[-0....|
| 2.0|(4,[0,1,2,3],[-0....|
| 2.0|(4,[0,1,2,3],[-0....|
| 2.0|(4,[0,1,2,3],[-0....|
| 1.0|(4,[0,2,3],[-0.94...|
| 2.0|(4,[0,1,2,3],[-0....|
| 0.0|(4,[0,1,2,3],[0.1...|
| 2.0|(4,[0,1,2,3],[-0....|
+-----+--------------------+
only showing top 20 rows
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
[Row(label=1.0, features=SparseVector(4, {0: -0.2222, 1: 0.5, 2: -0.7627, 3: -0.8333})),
Row(label=1.0, features=SparseVector(4, {0: -0.5556, 1: 0.25, 2: -0.8644, 3: -0.9167}))]
03.选择逻辑回归分类器作为基本二元分类器,且逻辑回归分类器的正则化参数设置为0.01,并构建模型
lr = LogisticRegression(regParam=0.01)
ovr = OneVsRest(classifier=lr)
model = ovr.fit(df)
04.查看得到的所有模型
print(model.models)
输出结果:
[LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4, LogisticRegressionModel: uid = LogisticRegression_31e28b0a5122, numClasses = 2, numFeatures = 4]
05.查看模型系数
print([model.models[i].coefficients for i in range(0, len(model.models))])
输出结果:
[DenseVector([0.5152, -1.09, 3.4683, 4.246]),
DenseVector([-2.1282, 3.1284, -2.6819, -2.3445]),
DenseVector([0.3064, -3.4213, 1.0461, -1.1383])]
06.查看模型截距
print([model.models[i].intercept for i in range(0, len(model.models))])
输出结果:
[-2.738074251493597, -2.564091464028484, -1.3244853130179222]
07.生成测试数据
sc = spark.sparkContext
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, 0.0, 1.0, 1.0))]).toDF()
model.transform(test0).show()
输出结果:
+------------------+----------+
| features|prediction|
+------------------+----------+
|[-1.0,0.0,1.0,1.0]| 0.0|
+------------------+----------+