Spark MLlib学习(二)——分类和回归

MLlib支持多种分类方法,如二分类、多分类和回归分析等。

问题类型 支持的方法
二分类 线性SVM, 逻辑回归,决策树,随机森林,GBDT,朴素贝叶斯
多分类 决策树,随机森林,朴素贝叶斯
回归 线性最小二乘法,Lasso, 岭回归,决策树,随机森林,GBDT,保序回归

1、线性模型

  • 分类(SVMs,逻辑回归)
  • 线性回归(最小二乘法、Lasso,岭回归)
    (1)分类
    Mlib提供两种分类方法:逻辑回归和线性支持向量机SVM。SVM只支持二分类,而逻辑回归既支持二分类又支持多分类。训练集数据用RDD[LabeledPoint]表示,label是分类的索引,从0开始。

  • 线性支持向量机SNMs

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc, "file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.6,0.4),seed = 11L)
val training = splits(0).cache()
val test = splits(1)

val numIterations = 100
val model = SVMWithSGD.train(training,numIterations)
model.clearThreshold()

val scoreAndLabels = test.map{point =>
    val score = model.predict(point.features)
    (score,point.label)
}
scoreAndLabels.take(5)

val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)

model.save(sc, "myModelPath")    //save and load model
val sameModel = SVMModel.load(sc, "myModelPath")
  • 逻辑回归 Logistic regression

L-BFGS支持二分逻辑回归和多项式逻辑回归,SGD只支持二分逻辑回归。L-BFGS不支持L1正则化,SGD版本支持L1正则化。当L1不是必须时,推荐使用L-BFGS版本,它通过拟牛顿近似Heaaian矩阵收敛的更快更准。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc, "file:///home/hdfs/data_mllib/sample_libsvm_data.txt")

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}  

val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision 
println("precision = "+ precision)

model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

(2)回归

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("file:///home/hdfs/data_mllib/lpsa.data")
val parsedData = data.map{ line =>
    val parts = line.split(',')
    LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

val valuesAndPreds = parsedData.map{point =>
    val prediction = model.predict(point.features)
    (point.label,prediction)
}

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

2、决策树
贪婪算法,递归地对特征空间做分裂处理。MLlib支持使用决策树做二分类、多分类和回归,技能处理连续特征又能使用类别特征。MLlib为分类提供了两种不纯度衡量方法(gini不纯度和熵),为回归提供了一种不纯度衡量方法(方差)。
ps:使用信息增益的算法有ID3、C4.5。ID3使用的是信息增益,在分裂时可倾向于属性较多的节点;C4.5是ID3的改进版,使用的是信息增益率,另外还基于信息增益对连续特征进行离散化处理。CART使用gini不纯度进行度量。
随机森林和迭代决策树的集成树算法在分类回归中使用率也较高。

(1)分类

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc,"file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7,0.3))
val (trainData,testData) = (splits(0),splits(1))

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

val labelAndPreds = testData.map { point =>
    val prediction = model.predict(point.features)
    (point.label,prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble /testData.count()
println("Test Error =" + testErr)
println("Learned classification tree model:\n" + model.toDebugString)

model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

(2)回归
不纯度度量不同

val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity,
  maxDepth, maxBins)

val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)

3、树的集成

随机森林与GBTs对比:
- GBTS一次只训练一颗树,训练时间大于随机森林,随机森林可并行地训练多颗树。 GBTs训练较浅的树,花费的时间也少。
- 随机森林不易过拟合,训练更多的树减少了过拟合的可能性,GBTs中训练过多的树会增加过拟合的可能性。(随机森林通过多棵树减少variance方差,GBTs通过多棵树减少bias偏置)。
- 随机森林易调优,效果会随着数数量的增加单调提升。

(1)随机森林

两个主要可调节参数:

- numTrees:森林中树的数量。增加数量可减少预测的方差,提升模型测试的准确率。
- maxDepth:森林中每棵树的最大深度。增加深度可提升模型的表达能力,但是过大会导致过拟合。通常,随机森林可设置比单棵树更大的深度。
-
两个一般不需要调整的参数,但是调整可以加速训练过程:
- subsamplingRate:
这个参数指定了森林中每棵树训练的数据的大小,它是当前数据集大小占总数据大小的比例。推荐默认值1.0,减小这个值可以提升训练速度。
- featureSubsetStrategy:
每棵树中使用的特征数量。这个参数可用小数比例的形式指定,也可以是总特征数量的函数。减少这个值会加速训练,但是如果太小会影响效果。
(2)GBTs
MLlib支持GBTs作二分类和回归,能够处理连续和类别变量。目前不支持多分类,对于多分类可以采用决策树和随机森林。

  • 分类
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

val data = MLUtils.loadLibSVMFile(sc,"file:///home/hdfs/data_mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7,0.3))
val (trainData,testData) = (splits(0),splits(1))

val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int,Int]()

val model = GradientBoostedTrees.train(trainData,boostingStrategy)

val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification GBT model:\n" + model.toDebugString)
  • 回归(变化不大)
val boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 3
boostingStrategy.treeStrategy.maxDepth = 5
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int,Int]()

val model = GradientBoostedTrees.train(trainData,boostingStrategy)

val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression GBT model:\n" + model.toDebugString)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值