spark ml 随机森林源码笔记一

最新推荐文章于 2024-04-03 13:55:21 发布

VIP文章 chencheng12077

最新推荐文章于 2024-04-03 13:55:21 发布

阅读量7.3k

点赞数 1

分类专栏： spark ml 机器学习源码

本文链接：https://blog.csdn.net/chencheng12077/article/details/52766274

版权

以回归为例吧，回归在某些场合可能更精准

支持连续变量和类别变量，类别变量就是某个属性有三个值,a,b,c，需要用Feature Transformers中的vectorindexer处理

上来是一堆参数

setMaxDepth：最大树深度

setMaxBins：最大装箱数，为了近似统计变量，比如变量有100个值，我只分成10段去做统计

setMinInstancesPerNode：每个节点最少实例

setMinInfoGain：最小信息增益

setMaxMemoryInMB：最大内存MB单位，这个值越大，一次处理的节点划分就越多

setCacheNodeIds：是否缓存节点id，缓存可以加速深层树的训练

setCheckpointInterval：检查点间隔，就是多少次迭代固化一次

setImpurity：随机森林有三种方式，entropy，gini,variance,回归肯定就是variance

setSubsamplingRate：设置采样率，就是每次选多少比例的样本构成新树

setSeed：采样种子，种子不变，采样结果不变

setNumTrees：设置森林里有多少棵树

setFeatureSubsetStrategy：设置特征子集选取策略，随机森林就是两个随机，构成树的样本随机，每棵树开始分裂的属性是随机的，其他跟决策树区别不大，注释这么写的

* The number of features to consider for splits at each tree node.
* Supported options:
* - "auto": Choose automatically for task://默认策略
* If numTrees == 1, set to "all." //决策树选择所有属性
* If numTrees > 1 (forest), set to "sqrt" for classification and //决策森林分类选择属性数开平方，回归选择三分之一属性
* to "onethird" for regression.
* - "all": use all features
* - "onethird": use 1/3 of the features
* - "sqrt": use sqrt(number of features)
* - "log2": use log2(number of features) //还有取对数的
* (default = "auto")
*
* These various settings are based on the following references:
* - log2: tested in Breiman (2001)
* - sqrt: recommended by Breiman manual for random forests
* - The defaults of sqrt (classification) and onethird (regression) match the R randomForest
* package.

参数完毕，下面比较重要的是这段代码

val categoricalFeatures: Map[Int, Int] =
MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))

这个地比较蛋疼的是dataset.schema($(featuresCol))

/** An alias for [[getOrDefault

最低0.47元/天解锁文章

chencheng12077

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
spark ml 随机森林源码笔记一

以回归为例吧，回归在某些场合可能更精准支持连续变量和类别变量，类别变量就是某个属性有三个值,a,b,c，需要用Feature Transformers中的vectorindexer处理上来是一堆参数setMaxDepth：最大树深度setMaxBins：最大装箱数，为了近似统计变量，比如变量有100个值，我只分成10段去做统计setMinInstancesPerNo
复制链接

扫一扫