SPARK官方实例：两种方法实现随机森林模型（ML/MLlib）

最新推荐文章于 2024-07-29 02:25:38 发布

O白马非马O

最新推荐文章于 2024-07-29 02:25:38 发布

阅读量5.5k

点赞数 4

分类专栏：数据挖掘 spark 文章标签： spark 机器学习算法

本文链接：https://blog.csdn.net/dahunbi/article/details/72821915

版权

本文介绍了在Spark 2.0及以上版本中，如何使用MLlib和ML库分别实现随机森林模型。MLlib基于RDD，而ML则采用ML Pipeline和DataFrame数据结构。文中提供了详细的源码和注释，帮助读者理解两者之间的差异，并给出了查看官方实例代码的途径。

摘要由CSDN通过智能技术生成

在spark2.0以上版本中，存在两种对机器学习算法的实现库MLlib与ML，比如随机森林：
org.apache.spark.mllib.tree.RandomForest
和
org.apache.spark.ml.classification.RandomForestClassificationModel

两种库对应的使用方法也不同，Mllib是RDD-based API，
ML是基于ML pipeline的API与dataframe的数据结构。
详见http://spark.apache.org/docs/latest/ml-guide.html
所以官方实例也是有很大区别的，下面分别给出了源码和注释：

MLlib的模型实现

// scalastyle:off println
package org.apache.spark.examples.mllib

import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// $example off$

object RandomForestClassificationExample {
   
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("RandomForestClassificationExample")
    val sc = new SparkContext(conf)
    // $example on$
    // Load and parse the data file.
    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit