spark 特征工程

最新推荐文章于 2024-04-16 23:58:33 发布

青山流水在深谷

最新推荐文章于 2024-04-16 23:58:33 发布

阅读量308

点赞数

分类专栏： spark 文章标签： minmaxScaler onehot-encode normalizer pca quantiledescretizer

本文链接：https://blog.csdn.net/dymkkj/article/details/83150465

版权

spark 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

一. 数据输入 test.csv

id,checking,duration,history,purpose,amount,savings,employed,installp,marital,coapp,resident,property,age,other,housing,existcr,job,depends,telephon,foreign,good_bad
1,1,6,4,3,1169,5,5,4,3,1,4,1,67,3,2,2,3,1,2,1,0
2,2,48,2,3,5951,1,3,2,2,1,2,1,22,3,2,1,3,1,1,1,1
3,4,12,4,6,2096,1,4,2,3,1,3,1,49,3,2,1,2,2,1,1,0
4,1,42,2,2,7882,1,4,2,3,3,4,2,45,3,3,1,3,2,1,1,0
5,1,24,3,0,4870,1,3,3,3,1,4,4,53,3,3,2,3,2,1,1,1
6,4,36,2,6,9055,5,3,2,3,1,4,4,35,3,3,1,2,2,2,1,0
7,4,24,2,2,2835,3,5,3,3,1,4,2,53,3,2,1,3,1,1,1,0
8,2,36,2,1,6948,1,3,2,3,1,2,3,35,3,1,1,4,1,2,1,0
9,4,12,2,3,3059,4,4,2,1,1,4,1,61,3,2,1,2,1,1,1,0
10,2,30,4,0,5234,1,1,4,4,1,2,3,28,3,2,2,4,1,1,1,1
11,2,12,2,0,1295,1,2,3,2,1,1,3,25,3,1,1,3,1,1,1,1
12,1,48,2,9,4308,1,2,3,2,1,4,2,24,3,1,1,3,1,1,1,1
13,2,12,2,3,1567,1,3,1,2,1,1,3,22,3,2,1,3,1,2,1,0
14,1,24,4,0,1199,1,5,4,3,1,4,3,60,3,2,2,2,1,1,1,1
15,1,15,2,0,1403,1,3,2,2,1,4,3,28,3,1,1,3,1,1,1,0
16,1,24,2,3,1282,2,3,4,2,1,2,3,32,3,2,1,2,1,1,1,1
17,4,24,4,3,2424,5,5,4,3,1,4,2,53,3,2,2,3,1,1,1,0

二.特征工程=minmaxScaler (最小值最大值离散化处理）

代码：

from pyspark.ml.feature import Normalizer
from pyspark.ml.feature import VectorAssembler

df=spark.read.csv("/home/jerry/geoplatform/gai_platform/data/feature_filter/res.csv",header=True,inferSchema=True)
vecAssembler = VectorAssembler(inputCols=["duration"],outputCol="c2_new")
mmScaler = MinMaxScaler(inputCol="c2_new",outputCol="mm_c2")
pipeline = Pipeline(stages=[vecAssembler,mmScaler])
pipeline_fit = pipeline.fit(df)
df_min_max = pipeline_fit.transform(df)
df_min_max.show(10)
df_min_max.show(10)

输出：

+---+--------+--------+-------+-------+------+-------+--------+--------+-------+-----+--------+--------+---+-----+-------+-------+---+-------+--------+-------+--------+------+--------------------+
| id|checking|duration|history|purpose|amount|savings|employed|installp|marital|coapp|resident|property|age|other|housing|existcr|job|depends|telephon|foreign|good_bad|c2_new|               mm_c2|
+---+--------+--------+-------+-------+------+-------+--------+--------+-------+-----+--------+--------+---+-----+-------+-------+---+-------+--------+-------+--------+------+--------------------+
| 1|       1|       6|      4|      3| 1169|      5|       5|       4|      3|    1|       4|       1| 67|    3|      2|      2| 3|      1|       2|      1|       0| [6.0]|[0.02941176470588...|
| 2|       2|      48|      2|      3| 5951|      1|       3|       2|      2|    1|       2|       1| 22|    3|      2|      1| 3|      1|       1|      1|       1|[48.0]|[0.6470588235294118]|
| 3|       4|      12|      4|      6| 2096|      1|       4|       2|      3|    1|       3|       1| 49|    3|      2|      1| 2|      2|       1|      1|       0|[12.0]|[0.11764705882352...|
| 4|       1|      42|      2|      2| 7882|      1|       4|       2|      3|    3|       4|       2| 45|    3|      3|      1| 3|      2|       1|      1|       0|[42.0]|[0.5588235294117647]|
| 5|       1|      24|      3|      0| 4870|      1|       3|       3|      3|    1|       4|       4| 53|    3|      3|      2| 3|      2|       1|      1|       1|[24.0]|[0.29411764705882...|
| 6|       4|      36|      2|      6| 9055|      5|       3|       2|      3|    1|       4|       4| 35|    3|      3|      1| 2|      2|       2|      1|       0|[36.0]|[0.47058823529411...|
| 7|       4|      24|      2|      2| 2835|      3|       5|       3|      3|    1|       4|       2| 53|    3|      2|      1| 3|      1|       1|      1|       0|[24.0]|[0.29411764705882...|
| 8|       2|      36|      2|      1| 6948|      1|       3|       2|      3|    1|       2|       3| 35|    3|      1|      1| 4|      1|       2|      1|       0|[36.0]|[0.47058823529411...|
| 9|       4|      12|      2|      3| 3059|      4|       4|       2|      1|    1|       4|       1| 61|    3|      2|      1| 2|      1|       1|      1|       0|[12.0]|[0.11764705882352...|
| 10|       2|      30|      4|      0| 5234|      1|       1|       4|      4|    1|       2|       3| 28|    3|      2|      2| 4|      1|       1|      1|       1|[30.0]|[0.38235294117647...|
+---+--------+--------+-------+-------+------+-------+--------+--------+-------+-----+--------+--------+---+-----+-------+-------+---+-------+--------+-------+--------+------+--------------------+

三.归一化，标准化

代码：

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer

df=spark.read.csv("/home/jerry/geoplatform/gai_platform/data/feature_filter/res.csv",header=True,inferSchema=True)

vecAssembler = VectorAssembler(inputCols=["id","duration"])
vecAssembler = VectorAssembler(inputCols=["id","duration"],outputCol="norm_features")
normalizer = Normalizer(p=2.0,inputCol="norm_features",outputCol="norm_test")
pipeline = Pipeline(stages=[vecAssembler,normalizer])
pipeline_fit=pipeline.fit(df)
df = pipeline_fit.transform(df)

df.select(["id","duration","norm_features","norm_test"])
DataFrame[id: int, duration: int, norm_features: vector, norm_test: vector]
df.select(["id","duration","norm_features","norm_test"]).show(10)

输出：

+---+--------+-------------+--------------------+
| id|duration|norm_features|           norm_test|
+---+--------+-------------+--------------------+
| 1|       6|    [1.0,6.0]|[0.16439898730535...|
| 2|      48|   [2.0,48.0]|[0.04163054471218...|
| 3|      12|   [3.0,12.0]|[0.24253562503633...|
| 4|      42|   [4.0,42.0]|[0.09480909262799...|
| 5|      24|   [5.0,24.0]|[0.20395425411200...|
| 6|      36|   [6.0,36.0]|[0.16439898730535...|
| 7|      24|   [7.0,24.0]|         [0.28,0.96]|
| 8|      36|   [8.0,36.0]|[0.21693045781865...|
| 9|      12|   [9.0,12.0]|           [0.6,0.8]|
| 10|      30| [10.0,30.0]|[0.31622776601683...|
+---+--------+-------------+--------------------+

四.oneHotEncoding(将特征（数值，非数值进行编码，便于后续入模训练)

代码：

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml import Pipeline

df=spark.read.csv("/home/jerry/geoplatform/gai_platform/data/feature_filter/res.csv",header=True,inferSchema=True)
stringindexer = StringIndexer(inputCol="amount",outputCol="one_hot_amount")
encoder = OneHotEncoder(dropLast=False,inputCol="one_hot_amount",outputCol="one_hot_test")
pipeline = Pipeline(stages=[stringindexer,encoder])
pipeline_fit=pipeline.fit(df)
df = pipeline_fit.transform(df)

#多列同时进行编码

将所有数据进行字符索引化（一般对于非数值类型特征进行编码-同sklearn labelencode）
for i in df.columns:
          #si = StringIndexer(inputCol=k, outputCol=k + '-indexed', handleInvalid="skip")
      stringindexer = StrngIndexer(inputCol=i,outputCol=i+"_new") #错误（该列存在空值）处理方式有error：跑出错误，skep:跳过错误继续执行
      model = stringindexer.fit(df)
      td = model.transform(df)
      df = td.drop(i).withColumnRenamed(i+"_new",i)

五.PCA(主成分分析）：进行关键特征选取

代码：

from pyspark.ml.feature import PCA
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
df=spark.read.csv("/home/jerry/geoplatform/gai_platform/data/feature_filter/res.csv",header=True,inferSchema=True)
input_col=["id","amount","checking","duration","history"]
vecAssembler = VectorAssembler(inputCols=input_col,outputCol="features")
pca=PCA(k=5,inputCol="features",outputCol="pca_test")
pipeline = Pipeline(stages=[vecAssembler,pca])
pipeline_fit = pipeline.fit(df)
df = pipeline_fit.transform(df)
df.select(['id', 'amount', 'checking', 'duration', 'history','features','pca_test']).show(2)

输出：

+---+------+--------+--------+-------+--------------------+--------------------+
| id|amount|checking|duration|history|            features|            pca_test|
+---+------+--------+--------+-------+--------------------+--------------------+
|947| 3349|       1|      24|      1|[947.0,3349.0,1.0...|[-3350.3696120054...|
| 17| 2424|       4|      24|      4|[17.0,2424.0,4.0,...|[-2424.0766224788...|
+---+------+--------+--------+-------+--------------------+--------------------+

6.按照某列值进行分箱，进行连续特征的离散化处理（可以使用标准化，归一化进行连续特征的离散化处理）

代码：

from pyspark.ml.feature import QuantileDiscretizer
df=spark.read.csv("/home/jerry/geoplatform/gai_platform/data/feature_filter/res.csv",header=True,inferSchema=True)
quantileDescretizer = QuantileDiscretizer(numBuckets = 4,inputCol="amount",outputCol="quantile_amount",relativeError=0,handleInvalid="error")
quantileDescretizer_model = quantileDescretizer.fit(df)
df = quantileDescretizer_model.transform(df)
df.show(20)

输出：

+---+------+---+--------+--------+-------+-------+-------+--------+--------+-------+-----+--------+--------+-----+-------+-------+---+-------+--------+-------+------------------+-----------+---------------+
| id|amount|age|checking|duration|history|purpose|savings|employed|installp|marital|coapp|resident|property|other|housing|existcr|job|depends|telephon|foreign|         sum_money|count_money|quantile_amount|
+---+------+---+--------+--------+-------+-------+-------+--------+--------+-------+-----+--------+--------+-----+-------+-------+---+-------+--------+-------+------------------+-----------+---------------+
|947| 3349| 30|       1|      24|      1|      2|      3|       2|       4|      3|    1|       4|       4|    3|      3|      1| 3|      2|       2|      1|          42055.55|         10|            2.0|
| 17| 2424| 53|       4|      24|      4|      3|      5|       5|       4|      3|    1|       4|       2|    3|      2|      2| 3|      1|       1|      1| 51659.56000000001|          8|            2.0|
|646| 7980| 27|       4|      36|      3|      9|      5|       2|       4|      3|    1|       4|       3|    3|      1|      2| 3|      1|       2|      1|          30752.23|          5|            3.0|
|674| 2080| 24|       4|       6|      4|      0|      3|       3|       1|      4|    1|       2|       3|    3|      2|      1| 3|      1|       1|      1|          49446.53|         10|            1.

进行分箱时对错误值（即该列存在的空值）的处理规则
保留空值：
df = df.drop("quantile_amount")
df = quantileDescretizer.setHandleInvalid("keep").fit(df).transform(df)
df.select(["id","amount","age","checking","count_money","quantile_amount"]).show(20)

跳过空值：
df = df.drop("quantile_amount")
df = quantileDescretizer.setHandleInvalid("skip").fit(df).transform(df)
df.select(["id","amount","age","checking","count_money","quantile_amount"]).show(20)

青山流水在深谷

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark 特征工程

一. 数据输入 test.csvid,checking,duration,history,purpose,amount,savings,employed,installp,marital,coapp,resident,property,age,other,housing,existcr,job,depends,telephon,foreign,good_bad1,1,6,4,3,1169,5...
复制链接

扫一扫