本文记录特征工程中常用的五种方法:MinMaxScaler,Normalization,OneHotEncoding,PCA以及QuantileDiscretizer 用于分箱
原有数据集如下图:
1. MinMaxScaler
from pyspark.ml.feature import MinMaxScaler
# 首先将c2列转换为vector的形式
vecAssembler = VectorAssembler(inputCols=["c2"], outputCol="c2_new")
# minmax tranform
mmScaler = MinMaxScaler(inputCol='c2_new', outputCol='mm_c2')
pipeline = Pipeline(stages=[vecAssembler, mmScaler])
pipeline_fit = pipeline.fit(df)
df = pipeline_fit.transform(df)
通过以上转换,可以将c2列转换为c2_new,结果如图:
2. Normalization
from pyspark.ml.feature import Normalizer
vecAssembler = VectorAssembler(inputCols=['c2', 'c5'], outputCol=