MultilayerPerceptronClassifier
class pyspark.ml.classification.MultilayerPerceptronClassifier(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxIter=100, tol=1e-06, seed=None, layers=None, blockSize=128, stepSize=0.03, solver=‘l-bfgs’, initialWeights=None, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’)
基于多层感知器的分类器训练器。 每一层都有sigmoid激活函数,输出层有softmax。 输入的数量必须等于特征向量的大小。 输出的数量必须等于标签的总数
blockSize = Param(parent=‘undefined’, name=‘blockSize’, doc=‘用于在矩阵中堆叠输入数据的块大小。数据堆叠在分区内。如果块大小大于分区中的剩余数据,则将其调整为 此数据的大小。建议大小在 10 到 1000 之间,默认为 128。’)
initialWeights = Param(parent=‘undefined’, name=‘initialWeights’, doc=‘模型的初始权重。’)
layers = Param(parent=‘undefined’, name=‘layers’, doc=‘从输入层到输出层的层大小例如,Array(780, 100, 10) 表示 780 个输入,一个具有 100 个神经元和输出的隐藏层10 个神经元的层。’)
maxIter = Param(parent=‘undefined’, name=‘maxIter’, doc=‘最大迭代次数 (>= 0).’)
predictionCol = Param(parent=‘undefined’, name=‘predictionCol’, doc=‘prediction column name.’)
probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件的列名 概率。注意:并非所有模型都输出经过良好校准的概率估计!这些概率应视为置信度,而不是精确概率。’)
solver = Param(parent=‘undefined’, name=‘solver’, doc=‘用于优化的求解器算法。支持的选项:l-bfgs, gd.’)
stepSize = Param(parent=‘undefined’, name=’ stepSize’, doc=‘每次优化迭代使用的步长 (>= 0).’)
tol = Param(parent=‘undefined’, name=‘tol’, doc=‘迭代算法的收敛容差( >= 0).’)
model.layers:模型的层大小数组(包括输入和输出层)
model.weights:层的权重
01.创建数据,查看结构:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
.config("spark.ui.showConsoleProgress","false").appName("MultilayerPerceptronClassifier")\
.master("local[*]").getOrCreate()
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([
(0.0, Vectors.dense([0.0, 0.0])),
(1.0, Vectors.dense([0.0, 1.0])),
(1.0, Vectors.dense([1.0, 0.0])),
(0.0, Vectors.dense([1.0, 1.0]))], ["label", "features"])
df.show()
df.printSchema()
输出结果:
+-----+---------+
|label| features|
+-----+---------+
| 0.0|[0.0,0.0]|
| 1.0|[0.0,1.0]|
| 1.0|[1.0,0.0]|
| 0.0|[1.0,1.0]|
+-----+---------+
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
02.使用多层感知机训练得到分类模型,并查看模型的层大小数组(包括输入和输出层),层的权重
from pyspark.ml.classification import MultilayerPerceptronClassifier
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[2, 2, 2], blockSize=1, seed=123)
model = mlp.fit(df)
print(model.layers)
print(model.weights)
输出结果:
[2, 2, 2]
[77.48393444396959,15.467598423079066,101.61312397521955,18.83896540095585,
-1.085572283116047,-22.201263602605078,-33.77774366026508,32.92565501563212,
22.252859675116593,-22.036303290856857,20.349302930001777,-20.256728061124935]
03.构造测试数据,并用上面的分类模型做转换,并查看结果
testDF = spark.createDataFrame([
(Vectors.dense([1.0, 0.0]),),
(Vectors.dense([0.0, 0.0]),)], ["features"])
model.transform(testDF).show()
print(model.transform(testDF).head(2))
输出结果:
+---------+--------------------+--------------------+----------+
| features| rawPrediction| probability|prediction|
+---------+--------------------+--------------------+----------+
|[1.0,0.0]|[-13.401987688017...|[4.88564876715289...| 1.0|
|[0.0,0.0]|[11.8220114663338...|[0.99999999995232...| 0.0|
+---------+--------------------+--------------------+----------+
[Row(features=DenseVector([1.0, 0.0]), rawPrediction=DenseVector([-13.402, 12.6427]), probability=DenseVector([0.0, 1.0]), prediction=1.0),
Row(features=DenseVector([0.0, 0.0]), rawPrediction=DenseVector([11.822, -11.9445]), probability=DenseVector([1.0, 0.0]), prediction=0.0)]