Bucketizer:一列连续特性映射到列桶的特性
class pyspark.ml.feature.Bucketizer(splits=None, inputCol=None, outputCol=None, handleInvalid=‘error’)
setHandleInvalid(value):设置handleInvalid的值
handleInvalid = Param(parent=‘undefined’, name=‘handleInvalid’, doc=“how to handle invalid entries. Options are ‘skip’ (filter out rows with invalid values), ‘error’ (throw an error), or ‘keep’ (keep invalid values in a special additional bucket).”)
如何处理无效的条目:选择“跳过”(过滤掉行无效值),“错误”(抛出一个错误),或“保持”(无效值保存在一个特殊的附加桶)。”)
01.初始化
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.3")\
.config("spark.ui.showConsoleProgress","false")\
.appName("Bucketizer").master("local[*]").getOrCreate()
02.生成数据和映射规则,并映射到桶
values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
df = spark.createDataFrame(values, ["values"])
bucketizer = Bucketizer(splits=[-float("inf"), 0.5, 1.4, float("inf")],
inputCol="values", outputCol="buckets")
bucketed = bucketizer.setHandleInvalid("keep").transform(df)
bucketed.show()
输出结果:
+------+-------+
|values|buckets|
+------+-------+
| 0.1| 0.0|
| 0.4| 0.0|
| 1.2| 1.0|
| 1.5| 2.0|
| NaN| 3.0|
| NaN| 3.0|
+------+-------+
03.对参数进行重新设置,并输出
bucketizer.setParams(outputCol="res").transform(df).show()
输出结果:
+------+---+
|values|res|
+------+---+
| 0.1|0.0|
| 0.4|0.0|
| 1.2|1.0|
| 1.5|2.0|
| NaN|3.0|
| NaN|3.0|
+------+---+