用pyspark学习《应用预测建模》（八）树模型、随机森林源码初探

本文链接：https://blog.csdn.net/littlehuangnan/article/details/126676252

本文深入探讨了Spark中随机森林的实现，从树模型的基础开始，解释了如何通过最小化误差平方和来决定分裂点，并介绍了RandomForestRegressor的train方法及其内部的run、runBagged过程。文章特别强调了在节点分裂和特征选择中的策略，如使用直方图优化和不排序特征，以及在selectNodesToSplit和findBestSplits方法中进行的准备工作。在findBestSplits中，通过DTStatsAggregator计算不纯度并选择最佳分裂点。递归划分的过程在处理nodesForGroup时体现，将非叶节点放入nodeStack进行后续分裂。最后，作者预告将要分析梯度提升树。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spark树模型源码比较复杂，这里争取把轮廓梳理出来。

回顾一下树模型。对于回归问题，一个节点分裂的规则是，分成两组后的误差平方和最小。对于每一个特征，找到误差平方和最小的分裂点，然后选取误差平方和最小的特征的分裂点作为分裂点。spark的实现没有对每个特征进行排序，而是采用了直方图。

由于spark的决策树实现就是一棵树的随机森林，所以直接看随机森林吧。

RandomForestRegressor的train方法调用org.apache.spark.ml.tree.impl.RandomForest的run，run调用runBagged，能看出这包含抽样的过程。

runBagged在一个nodeStack上进行迭代，每棵树有一个treeIndex。topNodes是一个存放树根的数组，最后借助它来构造随机森林模型。

/*
      Stack of nodes to train: (treeIndex, node)
      The reason this is a stack is that we train many trees at once, but we want to focus on
      completing trees, rather than training all simultaneously.  If we are splitting nodes from
      1 tree, then the new nodes to split will be put at the top of this stack, so we will continue
      training the same tree in the next iteration.  This focus allows us to send fewer trees to
      workers on each iteration; see topNodesForGroup below.
     */
    val nodeStack = new mutable.ListBuffer[(Int, LearningNode)]

    val topNodes = Array.fill[LearningNode](numTrees)(LearningNode.emptyNode(nodeIndex = 1))
    for (treeIndex <- 0 until numTrees) {
      nodeStack.prepend((treeIndex, topNodes(treeIndex)))
    }

    while (nodeStack.nonEmpty) {
      // Colle