在spark2.0以上版本中,存在两种对机器学习算法的实现库MLlib与ML,比如随机森林:
org.apache.spark.mllib.tree.RandomForest
和
org.apache.spark.ml.classification.RandomForestClassificationModel
两种库对应的使用方法也不同,Mllib是RDD-based API,
ML是基于ML pipeline的API与dataframe的数据结构。
详见http://spark.apache.org/docs/latest/ml-guide.html
所以官方实例也是有很大区别的,下面分别给出了源码和注释:
MLlib的模型实现
// scalastyle:off println
package org.apache.spark.examples.mllib
import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// $example off$
object RandomForestClassificationExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("RandomForestClassificationExample")
val sc = new SparkContext(conf)
// $example on$
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit