数据准备,我们根据labels列来离散化
数据长这样
scala> df.show(5)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|148.6041|4.1254973506233155| 1.0|
|163.6788|2.8350005837741903| 1.0|
|153.9485|1.8033965176854478| 1.0|
|150.3755|1.5140336026654098| 1.0|
| 150.738|1.6580451019197278| 1.0|
+--------+------------------+------+
only showing top 5 rows
scala> df.groupBy("labels").count.show
+------+-----+
|labels|count|
+------+-----+
| 1.0| 51|
| 4.0| 24|
| 3.0| 78|
| 2.0| 44|
| 5.0| 6|
+------+-----+
二元转换,根据阈值,左边数据为0,右边为1
ort org.apache.spark.ml.feature.Binarizer
val result=new Binarizer()//选择二元离散化函数
.setInputCol("labels")//选择目标列
.setOutputCol("binarizer_feature")//离散化后的列名
.setThreshold(3.0)//阈值
.transform(df)//需要离散化的dataframe
scala> result.show
+--------+------------------+------+-----------------+
| R1| G2|labels|binarizer_feature|
+--------+------------------+------+-----------------+
|148.6041|4.1254973506233155| 1.0| 0.0|
|163.6788|2.8350005837741903| 1.0| 0.0|
|153.9485|1.8033965176854478| 1.0| 0.0|
|150.3755|1.5140336026654098| 1.0| 0.0|
| 150.738|1.6580451019197278| 1.0| 0.0|
scala> result.groupBy("binarizer_feature").count.show
+-----------------+-----+
|binarizer_feature|count|
+-----------------+-----+
| 0.0| 173|
| 1.0| 30|
+-----------------+-----+
多元离散化,例如阈值为1,3,则负无穷到1,1到3,3到正无穷
val thresholdArray=Array[Double](100.0,120.0,130.0,150.0)//设定阈值
val buff=ArrayBuffer.empty[Double]//保存最终阈值
buff+=Double.NegativeInfinity//加入负无穷
for(a<-thresholdArray) buff+=a//加入数据设定的阈值
buff+= Double.PositiveInfinity//加入正无穷
//查看buff
scala> buff
res49: scala.collection.mutable.ArrayBuffer[Double] = ArrayBuffer(-Infinity, 1.0, 3.0, Infinity)
val result=new Bucketizer()//多元离散化函数
.setInputCol("R1")//选择待处理列
.setOutputCol("bucketizer_features")
.setSplits(buff.toArray)//阈值,为Array型
.transform(df)//需要处理的dataframe
scala> result.groupBy("bucketizer_features").count.show
+-------------------+-----+
|bucketizer_features|count|
+-------------------+-----+
| 0.0| 2|
| 1.0| 25|
| 4.0| 46|
| 3.0| 100|
| 2.0| 30|
+-------------------+-----+