原理
MLlib中特征值最大最小区间缩放:
参数说明:
E_{max}:特征实际最大值
E_{min}: 特征实际最小值
ei :特征值
max:MLlib 默认最大值1.0
min: MLlib 默认最小值0.0
1)当 Emax=Emin
Rescaled(ei)=0.5∗(max+min)
=0.5∗(1.0+0.0)=0.5
2)当 Emax≠Emin
Rescaled(ei)=ei−EminEmax−Emin∗(max−min)+min
=ei−EminEmax−Emin
实战
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object MinMaxExample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MinMaxScalerExample").setMaster("local[8]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// 读取libsvm格式的数据
// libsvm数据格式:
// 标签 索引位置:值 索引位置:值 ...
// 0 2:51 3:253 5:253
// 1 2:124 3:253 4:255
// 1 2:145 3:253 5:211
// 每条数据5的特征,特征索引从0-4
// 方便理解:每条数据表示成DenseVector格式:
// 0 1:0 2:51 3:253 4:0 5:253
// 1 1:0 2:124 3:253 4:255 5:0
// 1 1:0 2:145 3:253 4:0 5:211
val dataFrame = sqlContext.read.format("libsvm").load("data/libsvm.txt")
val scaler = new MinMaxScaler().setInputCol("features").setOutputCol("scaledFeatures")
// fit 每一特征最大最小值
// max:Vector(0,145,253,255,253)
// min: Vector(0,51,253,0,0)
val scalerModel = scaler.fit(dataFrame)
// transform 最大最小区间转换
// 1) 如果该特征值max = min ==> 0.5
// 2) 如果该特征值max != min ==> (values-min)/(max-min)
// 3) 返回DenseVector
val scaledData = scalerModel.transform(dataFrame)
scaledData.foreach(println)
sc.stop()
// 输出
// [0.0,(5,[1,2,4],[51.0,253.0,253.0]),[0.5,0.0,0.5,0.0,1.0]]
// [1.0,(5,[1,2,4],[145.0,253.0,211.0]),[0.5,1.0,0.5,0.0,0.83399209486166]]
// [1.0,(5,[1,2,3],[124.0,253.0,255.0]),[0.5,0.776595744680851,0.5,1.0,0.0]]
}
}