本文仅为作者练习和学习之用,大神轻喷。
前几天我在测试org.apache.spark.ml.regression.GBTRegressor时发现,当树达到200时,会出现StackOverflow。看起来spark的gbt并不容易调优,也就不容易在生产环境大规模使用。也许spark版的xgboost可以替代org.apache.spark.ml.regression.GBTRegressor。
用sql实现机器学习模型不是一个好主意。这个实现粗糙,而且比较慢。我的玩具代码借鉴了numpy-ml。numpy-ml的代码简洁优美,非常适合学习。
关于dataframe的结构,列x是标签,列features是org.apache.spark.ml.linalg.Vector类型的特征
首先我们需要实现回归树。基本数据结构是一个Node。Node用来存储左右子Node、分裂特征和分裂值。
case class Node(left:Option[Node], right:Option[Node], feature:Option[Int], split:Double)
构建tree的方法是grow。停止分裂的条件是剩余样本个数和深度,如果还要继续分裂则调用splitOneNode方法。
def grow(df:DataFrame,cur_depth:Int,spark:SparkSession): Node = {
if (df.count()==1) return Node(null,null,null,df.select(col("x").as("y")).first().getAs[Double]("y"))
if (cur_depth>=3) return Node(null,null,null,df.select(avg("x").as("y")).first().getAs[Double]("y"))
val (f, s) = splitOneNode(df, spark)
val left_df = df.filter{ row =>
val vec = row.getAs[Vector]("features")
vec.apply(f) <= s
}
val right_df = df.filter { row =>
val vec = row.getAs[Vector]("features")
vec.apply(f) > s
}
val left = grow(left_df, cur_depth+1, spark)
val right = grow(right_df, cur_depth+1, spark)
Node(Option(left), Option(right), Option(f), s)
}
splitOneNode基本上是用sql实现的。这里的分裂也采用了直方图的方法。默认分32个箱子。为了用sql统计每一个特征的直方图,需要把Vector类型的特征转换成键值对,这是用flatMap完成的。这里构造箱子强行用了笛卡儿积,这是效率低的一个原因吧。计算每一个箱子为分界下的平方和误差,取最小的那个箱子。然后汇总所有特征,找到最小误差对应的特征,这些都是用sql完成的。
def splitOneNode(df:DataFrame, spark:SparkSession): (Int, Double) = {
import spark.implicits._
df.flatMap { row =>
val vec = row.getAs[Vector]("features")
val x = row.getAs[Double]("x")
vec.toArray.toSeq.zipWithIndex.map { case (v, i) => (i, v, x) }
}.toDF("feature_name", "feature_val", "y").createOrReplaceTempView("table")
val num_bins = 32
(0 to num_bins).toDF("i").createOrReplaceTempView("idx_table")
spark.sql(
s"""
|select feature_name, min_val+i*bin_width edge
|from
|(select feature_name, min(feature_val) min_val, max(feature_val) max_val,
| (max(feature_val)-min(feature_val))/$num_bins bin_width
|from table
|group by feature_name
|) cross join idx_table
|""".stripMargin).createOrReplaceTempView("edge_table")
val result = spark.sql(
s"""
|select min_by(feature_name,err) feature,
| min_by(split_point,err) split_point,
| min(err) err
|from
|(select feature_name,
| min_by(edge, err) split_point,
| min(err) err
|from
|(select a.feature_name,
| edge,
| sum(case when feature_val <=edge then (y-left)*(y-left)
| else (y-right)*(y-right) end) err
|from table a
|join
|(select table.feature_name,
| edge,
| avg(if(feature_val <= edge, y, null)) left,
| avg(if(feature_val > edge, y, null)) right
|from table join edge_table on table.feature_name=edge_table.feature_name
|group by 1,2
|) b on a.feature_name=b.feature_name
|group by 1,2
|)
|group by 1
|)
|""".stripMargin).collect()
(result(0).getAs[Int]("feature"), result(0).getAs[Double]("split_point"))
}
然后是一个用来预测的方法。
@tailrec
def treePredict(row:Row, node:Node): Double = {
if (node.feature==null) node.split
else {
val f = node.feature.get
val vec = row.getAs[Vector]("features")
val value = vec.apply(f)
if (value <= node.split) treePredict(row, node.left.get)
else treePredict(row, node.right.get)
}
}
构建了树,就可以boosting了。
关键就是计算一下残差,用残差代替原来的标签
def treeEval(df:DataFrame, node:Node, spark:SparkSession): DataFrame = {
import spark.implicits._
val new_df = df.map { row =>
val x = row.getAs[Double]("x")
val vec = row.getAs[Vector]("features")
(vec, x-treePredict(row, node))
}.toDF("features","x")
//new_df.printSchema()
val err =
new_df.select((col("x")*col("x")).as("x_square"))
.agg(sum(col("x_square")).as("err")).first().getAs[Double]("err")
println("err ", err)
new_df
}
最后用一个循环完成100棵树的训练
//训练第一棵树
print("tree 0 ")
var node = grow(df, 0, spark)
//printTree(node)
//treePredict(df, node, spark).show(1000)
var new_df = treeEval(df, node, spark)
//new_df.printSchema()
//训练100棵树
(1 until 100).foreach{i =>
print(s"tree $i ")
node = grow(new_df, 0, spark)
new_df = treeEval(new_df, node, spark)
}
好吧,它离好的代码还差得远