Spark中组件Mllib的学习37之随机森林(Gini)进行分类

更多代码请见:https://github.com/xubo245/SparkLearning
Spark中组件Mllib的学习之分类篇
1解释
随机森林:RandomForest
大概思想就是生成多个决策树,都单独训练;如果来了一个数据,用各个决策树进行回归预测,如果是非连续结果,则取最多个数的值;如果连续,则取多个决策树结果的平均值。

2.代码:

/**
  * @author xubo
  *         ref:Spark MlLib机器学习实战
  *         more code:https://github.com/xubo245/SparkLearning
  *         more blog:http://blog.csdn.net/xubo245
  */
package org.apache.spark.mllib.learning.classification

import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by xubo on 2016/5/23.
  */
object RandomForest2Spark {
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$')))
    val sc = new SparkContext(conf)

    // Load and parse the data file.
    val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt")

    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a RandomForest model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val numTrees = 3 // Use more in practice.
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val impurity = "gini"
    val maxDepth = 4
    val maxBins = 32

    val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

    // Evaluate model on test instances and compute test error
    val labelAndPreds = testData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
    println("Test Error = " + testErr)
    println("Learned classification forest model:\n" + model.toDebugString)


    //    println("Learned classification tree model:\n" + model.toDebugString)
    println("data.count:" + data.count())
    println("trainingData.count:" + trainingData.count())
    println("testData.count:" + testData.count())
    println("model.algo:" + model.algo)
    println("model.trees:" + model.trees)

    println("labelAndPreds")
    labelAndPreds.take(10).foreach(println)

    //     Save and load model
    //    val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())
    //    val path = "file/data/mllib/output/classification/RandomForestModel" + iString + "/result"
    //    model.save(sc, path)
    //    val sameModel = RandomForestModel.load(sc, path)
    //    println(sameModel.algo)
    sc.stop
  }
}

3.结果:

Test Error = 0.04
Learned classification forest model:
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 511 <= 0.0)
     If (feature 434 <= 0.0)
      Predict: 0.0
     Else (feature 434 > 0.0)
      Predict: 1.0
    Else (feature 511 > 0.0)
     Predict: 0.0
  Tree 1:
    If (feature 490 <= 31.0)
     Predict: 0.0
    Else (feature 490 > 31.0)
     Predict: 1.0
  Tree 2:
    If (feature 302 <= 0.0)
     If (feature 461 <= 0.0)
      If (feature 208 <= 107.0)
       Predict: 1.0
      Else (feature 208 > 107.0)
       Predict: 0.0
     Else (feature 461 > 0.0)
      Predict: 1.0
    Else (feature 302 > 0.0)
     Predict: 0.0

data.count:100
trainingData.count:75
testData.count:25
model.algo:Classification
model.trees:[Lorg.apache.spark.mllib.tree.model.DecisionTreeModel;@753c93d5
labelAndPreds
(1.0,1.0)
(1.0,0.0)
(0.0,0.0)
(0.0,0.0)
(1.0,1.0)
(0.0,0.0)
(1.0,1.0)
(1.0,1.0)
(1.0,1.0)
(0.0,0.0)

参考
【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html
【2】http://spark.apache.org/docs/1.5.2/programming-guide.html
【3】https://github.com/xubo245/SparkLearning

MATLAB随机森林是一种集成学习方法,用于分类和回归任务。而Gini指标是随机森林用于评估特征重要性的一种度量方法。 随机森林通过组合多个决策树来提高结果的准确性和泛化能力。在训练过程,每个决策树都是通过从原始数据有放回地进行随机抽样的方式生成的。而在每个节点上,决策树都会选择一个最优特征来进行划分,以最大限度地提高类别的纯度或者回归的准确性。 Gini指标是评估特征重要性的一种方法,它衡量了某个特征在训练过程对结果的贡献程度。Gini指标基于决策树节点的纯度来计算,纯度越高,Gini指标越小。在随机森林Gini指标被用来选择每个节点的最优特征。在节点划分时,算法会计算每个特征的Gini指标,并选择其Gini指标最小的特征作为划分依据。 对于分类任务,Gini指标的计算方法为:先计算每个类别的概率,再根据这些概率计算Gini指标。在每个节点上,Gini指标为1减去所有类别的概率的平方和。 对于回归任务,Gini指标的计算方法为:将训练数据按特征值排序,计算每个特征值的平均值,然后根据平均值将数据分为左右两部分,再计算每部分因变量的方差。最后,根据左右两部分的方差和总方差来计算Gini指标。 总之,随机森林利用Gini指标来选择最优特征,从而提高分类准确性和回归的精确性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值