Pyspark分类--LinearSVC

最新推荐文章于 2024-07-31 15:08:57 发布

Gadaite

最新推荐文章于 2024-07-31 15:08:57 发布

阅读量1.7k

点赞数 1

分类专栏： ML基础文章标签：分类机器学习人工智能

本文链接：https://blog.csdn.net/weixin_46408961/article/details/123415615

版权

ML基础专栏收录该内容

43 篇文章 8 订阅

订阅专栏

该博客介绍了如何利用PySpark的LinearSVC类进行二元分类。通过创建数据集，设置参数如最大迭代次数、正则化参数等训练模型，并展示模型的系数、类别数及特征数。实验结果展示了模型的预测能力和原始预测值。

摘要由CSDN通过智能技术生成

LinearSVC:支持向量机线性分类LINEARSVC模型

class pyspark.ml.classification.LinearSVC(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol=‘rawPrediction’, fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2)

这个二元分类器使用 OWLQN 优化器优化铰链损失。目前只支持 L2 正则化(当前版本2.4.5)

maxIter = Param(parent=‘undefined’, name=‘maxIter’, doc=‘最大迭代次数 (>= 0).’)

predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘预测列名.’)

regParam = Param(parent=‘undefined’, name=‘regParam’, doc=‘正则化参数 (>= 0).’)

tol = Param(parent=‘undefined’, name=‘tol’, doc=‘迭代算法的收敛容差 (>= 0).’)

rawPredictionCol= Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测(a.k.a. confidence) column name.’)*

fitIntercept = Param(parent=‘undefined’, name=‘fitIntercept’, doc=‘是否适合截取项。’)

standardization = Param(parent=‘undefined’, name=‘standardization’, doc=‘在拟合模型之前是否对训练特征进行标准化。’)

threshold = Param(parent=‘undefined’, name=‘threshold’, doc=‘二进制分类中应用于线性模型预测的阈值。这个阈值可以是任何实数，其中 Inf 将使所有预测为 0.0，-Inf 将做出所有预测 1.0。’）

weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘weight 列名。如果未设置或为空，我们将所有实例权重视为 1.0。’)

aggregationDepth = Param(parent=‘undefined’, name=‘aggregationDepth’, doc=‘树聚合 (>= 2) 的建议深度。’)

model.coefficients:线性 SVM 分类器的模型系数

model.numClasses:类数（标签可以采用的值）

model.numFeatures:返回模型训练的特征数量。如果未知，则返回 -1

01.创建数据集

from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("LinearSVC")\
    .master("local[*]").getOrCreate()
sc = spark.sparkContext
df = sc.parallelize([
    Row(label=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
    Row(label=0.0, features=Vectors.dense(1.0, 2.0, 3.0))
]).toDF()
df.show()

输出结果：

+-------------+-----+
|     features|label|
+-------------+-----+
|[1.0,1.0,1.0]|  1.0|
|[1.0,2.0,3.0]|  0.0|
+-------------+-----+

02.使用LinearSVC转换

from pyspark.ml.classification import LinearSVC
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(df)
model.transform(df).show()
model.transform(df).head(3)

输出结果：

+-------------+-----+--------------------+----------+
|     features|label|       rawPrediction|prediction|
+-------------+-----+--------------------+----------+
|[1.0,1.0,1.0]|  1.0|[-0.5581623056159...|       1.0|
|[1.0,2.0,3.0]|  0.0|[0.08756571302736...|       0.0|
+-------------+-----+--------------------+----------+
[Row(features=DenseVector([1.0, 1.0, 1.0]), label=1.0, rawPrediction=DenseVector([-0.5582, 0.5582]), prediction=1.0),
 Row(features=DenseVector([1.0, 2.0, 3.0]), label=0.0, rawPrediction=DenseVector([0.0876, -0.0876]), prediction=0.0)]

03.线性 SVM 分类器的模型系数

model.coefficients

输出结果：DenseVector([0.0, -0.2792, -0.1833])

04.类数（标签可以采用的值）

model.numClasses

输出结果：2

05.模型训练的特征数量

model.numFeatures

输出结果：3

06.生成数据进行预测

test0 = sc.parallelize([Row(features=Vectors.dense(-1.0, -1.0, -1.0))]).toDF()
result = model.transform(test0).head()
result.prediction

输出结果：1.0

07.查看原始预测

result.rawPrediction

输出结果：DenseVector([-1.4831, 1.4831])

Gadaite

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Pyspark分类--LinearSVC

LinearSVC:支持向量机线性分类LINEARSVC模型class pyspark.ml.classification.LinearSVC(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxIter=100, regParam=0.0, tol=1e-06, rawPredictionCol=‘rawPrediction’, fitIntercept=True, standardization=True,
复制链接

扫一扫

专栏目录