spark1.2.0源码MLlib --- 决策树-03

Yobadman

于 2015-01-27 21:19:10 发布

阅读量1.4k

点赞数

分类专栏： spark源码文章标签： spark 大数据源码

本文链接：https://blog.csdn.net/Yobadman/article/details/43202887

版权

本文深入探讨Spark 1.2.0版本中MLlib模块的决策树实现，重点在于理解在节点分裂过程中，如何对数据进行汇总以计算不纯度和信息增益。通过分析`DecisionTree.findBestSplits()`方法，特别是`points.foreach(binSeqOp(nodeStatsAggregators, _))`这一步，展示了按分区聚合数据的细节，以及EntropyAggregator的`update()`方法在计算熵中的作用。" 126181758,1944903,Transformer模型解析：PyTorch实现与应用,"['深度学习', '自然语言处理', 'PyTorch', '模型结构']

摘要由CSDN通过智能技术生成

本章重点关注树中各节点分裂过程中，如何将相应的数据进行汇总，以便之后计算节点不纯度及信息增益，最终确定分裂的顺序。

首先，从 DecisionTree.findBestSplits() 开始，这个方法代码很长，按照执行顺序来看，代码如下：

    val partitionAggregates : RDD[(Int, DTStatsAggregator)] = if (nodeIdCache.nonEmpty) {  //节点缓存的情况
      input.zip(nodeIdCache.get.nodeIdsForInstances).mapPartitions { points =>
        // Construct a nodeStatsAggregators array to hold node aggregate stats,
        // each node will have a nodeStatsAggregator
        val nodeStatsAggregators = Array.tabulate(numNodes) { nodeIndex =>
          val featuresForNode = nodeToFeaturesBc.value.flatMap { nodeToFeatures =>
            Some(nodeToFeatures(nodeIndex))
          }
          new DTStatsAggregator(metadata, featuresForNode)
        }

        // iterator all instances in current partition and update aggregate stats
        points.foreach(binSeqOpWithNodeIdCache(nodeStatsAggregators, _))

        // transform nodeStatsAggregators array to (nodeIndex, nodeAggregateStats) pairs,
        // which can be combined with other partition using `reduceByKey`
        nodeStatsAggregators.view.zipWithIndex.map(_.swap).iterator
      }
    } else {  //节点不缓存的情况
      in