背景:
自2017年微软开源lightgbm以来,各大算法相关赛事前排队伍当中都不乏lightgbm的身影:
但大部分参赛者都是纯Python写的单机脚本而已。为了借助当前流行的spark处理大量数据,我开始了lightgbm on spark之旅。
遇到的问题:
问题一:Dataset create call failed in LightGBM with error: bad allocation LightGBM
当我试图用小数据集的时候(200条内),发现当特征数(类别型、数值型都有)列数大于4列,就会概率性出现此问题,在GitHub网友讨论下得知加大数据集后可避免此问题,于是我加到了几万条数据,确实没有这个问题了。目测是样本数少于参数树的棵数/叶子数等原因导致。
问题二:BarrierJobUnsupportedRDDChainException
网友们说是scala版本的要setUseBarrierExecutionMode(true) 我试了,用不用都一样效果,似乎这个还引起在spark on yarn上面很慢。我是直接limit(N)处理的,原因暂时未明。
问题三:训练阶段,集群模式训练慢:加大并行度时候,slave的CPU利用率低;建议直接用一台高配的服务器申请一组大核心跑。(注:Linux内部会自动分担各个核心的计算压力;预测阶段无此问题)。
经过逐步核查慢的位置,我去掉了setValidationIndicatorCol("validate") (从AUC看似乎没起作用,还很慢,因为它引发了LightGBMBase.scala中的一个collect,这就比较要命了)。同时也去掉了setUseBarrierExecutionMode(true) ,此时集群已经能跑得动了,训练了4.9H,然后AUC=0.85,但是预测3000KW记录居然也花了24min。CPU利用率低下问题还待分析。
问题四:不能像Python脚本那样直接拿到叶子索引 lgbm.predict(val_df,pred_leaf=True)
GitHub上面作者说后面会加上这个功能来着,至于何时,未知。
附部分特征处理过程(详细的过段时间不加班了再放GitHub上吧):
val catalog_features = Array("countrycode","itemID","id","sex")
for(catalog_feature <- catalog_features){
val indexer = new StringIndexer()
.setInputCol(catalog_feature)
.setOutputCol(catalog_feature.concat("_index"))
val train_index_model: StringIndexerModel = indexer.fit(train_index)
train_index_model.write.overwrite().save( savePathMl+"model/"+catalog_feature+"SongModel")
val train_indexed = train_index_model.transform(train_index)
train_index = train_indexed
}
println("字符串编码下标标签:")
val vecCols=Array("countrycode_index","itemID_index","id_index","sex_index","category2vec","userFeature","itemFeature","cosScore","dotScore")
val lgbmAssembler = new VectorAssembler().setInputCols(vecCols).setOutputCol("gbdtFeature")
val classifier = new LightGBMClassifier()
.setLabelCol("play")
.setObjective("binary")
.setCategoricalSlotNames(Array("countrycode_index","itemID_index","id_index","sex_index"))
// .setUseBarrierExecutionMode(true) // 超慢
.setFeaturesCol("gbdtFeature")
.setPredictionCol("predictPlay")
.setNumIterations(8000) // Number of iterations, LightGBM constructs num_class * num_iterations trees
.setNumLeaves(32)
.setLearningRate(0.002)
.setProbabilityCol("probabilitys")
.setEarlyStoppingRound(200).setBoostingType("gbdt").setLambdaL1(0.01).setLambdaL2(0.01).setMaxDepth(12)
val lgbmTrain = lgbmAssembler.transform(train_index.selectExpr("",...)
val lgbmModel =classifier.fit(lgbmTrain.limit(90000000)
参考:
Lgbm+LR:
http://www.manongjc.com/detail/13-uogsrviojdreomg.html
Spark MLlib数据transform等操作:
http://spark.apache.org/docs/2.4.6/ml-features.html#onehotencoder-deprecated-since-230
http://spark.apache.org/docs/2.4.6/ml-features.html#vectorassembler
lgbm调参:
https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
http://www.manongjc.com/detail/13-uogsrviojdreomg.html
https://lightgbm.readthedocs.io/en/latest/Features.html#data-parallel