ChiSqSelector:卡方选择器
class pyspark.ml.feature.ChiSqSelector(numTopFeatures=50, featuresCol=‘features’, outputCol=None, labelCol=‘label’, selectorType=‘numTopFeatures’, percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)
卡方特征选择,选择分类特征用于预测分类标签
选择器支持不同选择方法:numTopFeatures, percentile, fpr, fdr, fwe
numTopFeatures:根据卡方检验选择固定数量的顶级特征。
percentile:百分位数相似,但选择所有特征的一小部分而不是固定数字。
fpr:选择 p 值低于阈值的所有特征,从而控制选择的误报率。
fdr:使用 Benjamini-Hochberg 过程选择错误发现率低于阈值的所有特征。
fwe:我们选择所有 p 值低于阈值的特征。阈值按 1/numFeatures 缩放,从而控制选择的全族错误率。
默认情况下,选择方法为 numTopFeatures,默认顶部特征数设置为 50
01.初始化
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("ChiSqselector").master("local[*]").getOrCreate()
02.创建数据
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame(
[(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
(Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),
(Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],
["features", "label"])
df.show()
输出结果:
+------------------+-----+
| features|label|
+------------------+-----+
|[0.0,0.0,18.0,1.0]| 1.0|
|[0.0,1.0,12.0,0.0]| 0.0|
|[1.0,0.0,15.0,0.1]| 0.0|
+------------------+-----+
03.使用卡方选择器,设定numTopFeatures=1
from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures")
model = selector.fit(df)
model.transform(df).show()
输出结果:
+------------------+-----+----------------+
| features|label|selectedFeatures|
+------------------+-----+----------------+
|[0.0,0.0,18.0,1.0]| 1.0| [18.0]|
|[0.0,1.0,12.0,0.0]| 0.0| [12.0]|
|[1.0,0.0,15.0,0.1]| 0.0| [15.0]|
+------------------+-----+----------------+
04.使用卡方选择器,设定numTopFeatures=2
selector2 = ChiSqSelector(numTopFeatures=2, outputCol="selectedFeatures")
model2 = selector2.fit(df)
model2.transform(df).show()
输出结果:
+------------------+-----+----------------+
| features|label|selectedFeatures|
+------------------+-----+----------------+
|[0.0,0.0,18.0,1.0]| 1.0| [18.0,1.0]|
|[0.0,1.0,12.0,0.0]| 0.0| [12.0,0.0]|
|[1.0,0.0,15.0,0.1]| 0.0| [15.0,0.1]|
+------------------+-----+----------------+