xgboost on spark

最新推荐文章于 2024-05-22 18:31:24 发布

ukakasu

最新推荐文章于 2024-05-22 18:31:24 发布

阅读量4.3k

点赞数

分类专栏： spark 机器学习

本文链接：https://blog.csdn.net/ukakasu/article/details/80052859

版权

机器学习同时被 2 个专栏收录

33 篇文章 0 订阅

订阅专栏

spark

4 篇文章 1 订阅

订阅专栏

背景

项目需要预测出每一类别的概率，spark ml、mlib中自带算法只能预测出所属类别满足不了需求，因此找到此算法。

版本

spark1.6只能用XGBoost0.7之前的版本，此版本训练及预测只能使用rdd不能用df造成一定的不便，预测出的结果只有概率值，需自己与原始数据关联得到完整的记录，最大概率所属类别需自己算出。因此选择了spark2.0与XGBoost0.7。

scala代码

/**
 * train XGBoost model with the DataFrame-represented data
 *  trainingData the trainingset represented as DataFrame
 *  params Map containing the parameters to configure XGBoost
 *  round the number of iterations
 *  nWorkers the number of xgboost workers, 0 by default which means that the number of
 *                 workers equals to the partition number of trainingData RDD
 *  obj the user-defined objective function, null by default
 *  eval the user-defined evaluation function, null by default
 *  useExternalMemory indicate whether to use external memory cache, by setting this flag as
 *                           true, the user may save the RAM cost for running XGBoost within Spark
 * missing the value represented the missing value in the dataset
 * featureCol the name of input column, "features" as default value
 *  labelCol the name of output column, "label" as default value
 */

val maxDepth = args(0).toInt
val numRound = args(1).toInt
val nworker = args(2).toInt
val paramMap = List(
  "eta" -> 0.01, //学习率
  "gamma" -> 0.1, //用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。
  "lambda" -> 2, //控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
  "subsample" -> 0.8, //随机采样训练样本
  "colsample_bytree" -> 0.8, //生成树时进行的列采样
  "max_depth" -> maxDepth, //构建树的深度，越大越容易过拟合
  "min_child_weight" -> 5,
  "objective" -> "multi:softprob",  //定义学习任务及相应的学习目标
  "eval_metric" -> "merror",
  "num_class" -> 21
).toMap

val model:XGBoostModel = XGBoost.trainWithDataFrame(vecDF, paramMap, numRound, nworker,
  useExternalMemory = true,
  featureCol = "features",
  labelCol = "label",
  missing = 0.0f)

//predict the test set
val predict:DataFrame = model.transform(vecDF)

参数参考： http://blog.csdn.net/zc02051126/article/details/46711047

注意partition、work、excutor的对应关系

ukakasu

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
7
评论
xgboost on spark

背景项目需要预测出每一类别的概率，spark ml、mlib中自带算法只能预测出所属类别满足不了需求，因此找到此算法。版本 spark1.6只能用XGBoost0.7之前的版本，此版本训练及预测只能使用rdd不能用df造成一定的不便，预测出的结果只有概率值，需自己与原始数据关联得到完整的记录，最大概率所属类别需自己算出。因此选择了spark2.0与XGBoost0.7。scala...
复制链接

扫一扫