Pyspark分类--RandomForestClassifier

最新推荐文章于 2024-01-30 23:05:19 发布

Gadaite

最新推荐文章于 2024-01-30 23:05:19 发布

阅读量1k

点赞数

分类专栏： ML基础文章标签：分类机器学习

本文链接：https://blog.csdn.net/weixin_46408961/article/details/123415624

版权

ML基础专栏收录该内容

43 篇文章 6 订阅

订阅专栏

RandomForestClassifier

class pyspark.ml.classification.RandomForestClassifier(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity=‘gini’, numTrees=20, featureSubsetStrategy=‘auto’, seed=None, subsamplingRate=1.0)

用于分类的随机森林学习算法。它支持二分类和多类标签，以及连续和分类特征。

cacheNodeIds = Param(parent=‘undefined’, name=‘cacheNodeIds’, doc=‘如果为false，算法会将树传递给executors以匹配实例与节点。如果为true，算法将为每个实例缓存节点ID。缓存可以加快深度树的训练。用户可以通过设置 checkpointInterval 来设置缓存检查点的频率或禁用它。’)

checkpointInterval = Param(parent=‘undefined’, name=‘checkpointInterval’, doc=‘set checkpoint interval (> = 1) 或禁用检查点 (-1)。例如 10 表示缓存将每 10 次迭代获得检查点。注意：如果检查点目录未在 SparkContext 中设置，则此设置将被忽略。’)

featureSubsetStrategy = Param(parent=‘undefined’, name=‘featureSubsetStrategy’, doc="每个树节点要考虑的特征数。支持的选项：‘auto’（任务自动选择：如果 numTrees == 1，设置为 ‘all’。如果 numTrees > 1（森林），设置为 ‘sqrt’ 用于分类，设置为 ‘onethird’ 用于回归），‘all’（使用所有特征），‘onethird’（使用 1/3），‘sqrt’（使用 sqrt（特征数量）），‘log2’（使用 log2（特征数量）），‘n’（当 n 在 (0, 1.0] 范围内时，使用 n * 特征数量。当n 在范围内（1，特征数），使用 n 个特征）。默认 = ‘auto’ ")

impurity = Param(parent=‘undefined’, name=‘impurity’, doc=‘用于信息增益计算的标准 (不区分大小写). Supported options: entropy, gini’)

maxBins = Param(parent=‘undefined’, name=‘maxBins’, doc=‘离散连续特征的最大 bin 数。必须 >=2 并且 >= 任何分类特征的类别数。’)

maxDepth = Param( parent=‘undefined’, name=‘maxDepth’, doc=‘树的最大深度。(>= 0) 例如，深度 0 表示 1 个叶节点；深度 1 表示 1 个内部节点 2 个叶节点。’)

maxMemoryInMB = Param(parent=‘undefined’, name=‘maxMemoryInMB’, doc=‘分配给直方图聚合的最大内存MB。如果太小，则每次迭代将拆分1个节点，其聚合可能超过此大小。’)

minInfoGain = Param(parent=‘undefined’, name=‘minInfoGain’, doc=‘在树节点处考虑拆分的最小信息增益。’)

minInstancesPerNode = Param(parent=‘undefined’, name=‘minInstancesPerNode’, doc=‘拆分后每个孩子必须拥有的最小实例数。如果拆分导致左或右孩子少于minInstancesPerNode，则拆分将被视为无效而丢弃。应> = 1 .’)

numTrees = Param(parent=‘undefined’, name=‘numTrees’, doc=‘要训练的树数 (>= 1).’)

probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件概率的列名。注意：并非所有模型都输出经过良好校准的概率估计！这些概率应该被视为置信度，而不是精确概率 .’)

rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测（又名置信度）列名。’)

subsamplingRate = Param(parent=‘undefined’, name=‘subsamplingRate’, doc=‘用于学习每个决策树的训练数据的分数，在 (0, 1] 范围内。’)

model.treeWeights:返回每棵树的权重

model.totalNumNodes:节点总数，对集成中的所有树求和

model.toDebugString:模型的完整描述

model.trees:在这个森林中的树木。警告：这些有 null 父 Estimators

01.构造数据，并进行数值化

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
    .config("spark.ui.showConsoleProgress","false").appName("RandomForestClassifier")\
    .master("local[*]").getOrCreate()
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([
    (1.0, Vectors.dense(1.0)),
    (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
df.show()
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
td.show()

输出结果：

+-----+---------+
|label| features|
+-----+---------+
|  1.0|    [1.0]|
|  0.0|(1,[],[])|
+-----+---------+

+-----+---------+-------+
|label| features|indexed|
+-----+---------+-------+
|  1.0|    [1.0]|    1.0|
|  0.0|(1,[],[])|    0.0|
+-----+---------+-------+

02.使用随机森林分类器训练得到分类模型

from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
model = rf.fit(td)

03.使用模型转换原数据看看

model.transform(td).show()
print(model.transform(td).head(2))

输出结果：

+-----+---------+-------+-------------+--------------------+----------+
|label| features|indexed|rawPrediction|         probability|prediction|
+-----+---------+-------+-------------+--------------------+----------+
|  1.0|    [1.0]|    1.0|    [1.0,2.0]|[0.33333333333333...|       1.0|
|  0.0|(1,[],[])|    0.0|    [3.0,0.0]|           [1.0,0.0]|       0.0|
+-----+---------+-------+-------------+--------------------+----------+

[Row(label=1.0, features=DenseVector([1.0]), indexed=1.0, rawPrediction=DenseVector([1.0, 2.0]), probability=DenseVector([0.3333, 0.6667]), prediction=1.0),
 Row(label=0.0, features=SparseVector(1, {}), indexed=0.0, rawPrediction=DenseVector([3.0, 0.0]), probability=DenseVector([1.0, 0.0]), prediction=0.0)]

04.查看特征的重要程度

(1,[0],[1.0])

05.返回每棵树的权重数组，并使用numpy的allclose比较是否相等

print(model.treeWeights)
import numpy
print(numpy.allclose(model.treeWeights, [1.0, 1.0, 1.0]))

输出结果：

[1.0, 1.0, 1.0]
True

06.创建预测数据，使用上面模型转换并查看结果

第一个测试数据：

test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
test0.show()
model.transform(test0).show()
print(model.transform(test0).head(1))

输出结果：

+--------+
|features|
+--------+
|  [-1.0]|
+--------+

+--------+-------------+-----------+----------+
|features|rawPrediction|probability|prediction|
+--------+-------------+-----------+----------+
|  [-1.0]|    [3.0,0.0]|  [1.0,0.0]|       0.0|
+--------+-------------+-----------+----------+

[Row(features=DenseVector([-1.0]), rawPrediction=DenseVector([3.0, 0.0]), probability=DenseVector([1.0, 0.0]), prediction=0.0)]

第二个测试数据：

test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
model.transform(test1).show()
print(model.transform(test1).head(1))

输出结果：

+-------------+-------------+--------------------+----------+
|     features|rawPrediction|         probability|prediction|
+-------------+-------------+--------------------+----------+
|(1,[0],[1.0])|    [1.0,2.0]|[0.33333333333333...|       1.0|
+-------------+-------------+--------------------+----------+

[Row(features=SparseVector(1, {0: 1.0}), rawPrediction=DenseVector([1.0, 2.0]), probability=DenseVector([0.3333, 0.6667]), prediction=1.0)]

07.返回模型的相关信息

print(model.totalNumNodes)
print(model.trees)
print(model.toDebugString)

输出结果：

7

[DecisionTreeClassificationModel (uid=dtc_9c2661dd0c4d) of depth 0 with 1 nodes, DecisionTreeClassificationModel (uid=dtc_f2a3aefdf16e) of depth 1 with 3 nodes, DecisionTreeClassificationModel (uid=dtc_d6c8f31954a3) of depth 1 with 3 nodes]

RandomForestClassificationModel (uid=RandomForestClassifier_9a4b8e792988) with 3 trees
  Tree 0 (weight 1.0):
    Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 0 <= 0.5)
     Predict: 0.0
    Else (feature 0 > 0.5)
     Predict: 1.0
  Tree 2 (weight 1.0):
    If (feature 0 <= 0.5)
     Predict: 0.0
    Else (feature 0 > 0.5)
     Predict: 1.0

Gadaite

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Pyspark分类--RandomForestClassifier

RandomForestClassifierclass pyspark.ml.classification.RandomForestClassifier(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, maxDepth=5, maxBins=32, minInstancesPerNode=
复制链接

扫一扫