Binarize :在给定阈值的情况下对一列连续特征进行二值化
class pyspark.ml.feature.Binarizer(threshold=0.0, inputCol=None, outputCol=None)[[source]](https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer
threshold:用于单列,thresholds:用于多列(当前版本2.4.5不支持)
threshold即为阈值
inputCol:用于单列,inputCols:用于多列(当前版本2.4.5不支持)
01.创建对象
from pyspark.sql import SparkSession
from pyspark.ml.feature import Binarizer
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("Binarize").master("local[*]").getOrCreate()
02.创建数据
data = spark.createDataFrame([
(0.1,),
(2.3,),
(1.1,),
(4.2,),
(2.5,),
(6.8,),
],["values"])
data.show()
输出结果:
+------+
|values|
+------+
| 0.1|
| 2.3|
| 1.1|
| 4.2|
| 2.5|
| 6.8|
+------+
03.创建一个Binarize对象,参数中指定输入列,阈值和输出列
binarizer = Binarizer(threshold=2.4,inputCol="values",outputCol="features")
04.转换原始数据并查看结果
res = binarizer.transform(data)
res.show()
输出结果
+------+--------+
|values|features|
+------+--------+
| 0.1| 0.0|
| 2.3| 0.0|
| 1.1| 0.0|
| 4.2| 1.0|
| 2.5| 1.0|
| 6.8| 1.0|
+------+--------+
05.查看结构
res.printSchema()
1输出结果:
root
|-- values: double (nullable = true)
|-- features: double (nullable = true)