卡方检验:主要用于两个和两个以上样本率(构成比)及两个二值型离散变量的关联性分析,即比较理论频次与实际频次的吻合程度或拟合程度。
以iris数据集为例,在sklearn库中使用卡方检验筛选与目标变量相关的特征:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
# 导入卡方检验
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
print("X,y行列数: ", X.shape, y.shape)
ChiValues = chi2(X, y)
# 计算X单个特征变量对目标变量y的卡方值和P值
print("计算X单个特征变量对目标变量y的卡方值和P值: ", ChiValues)
'''
sklearn.feature_selection.SelectKBest中基于卡方chi2,提取出来2个比较好的特征变量,
可以理解为在所有特征变量里面相对更好的特征,
但是并不是统计里面分类变量与目标变量通过卡方检验得出的是否相关的结果
'''
sk = SelectKBest(chi2, k=2)
X_new = sk.fit_transform(X, y)
print(sk.scores_) # X单个特征变量对目标变量y的卡方值
print(sk.pvalues_) # p值
# 筛选出来的两个特征
print("筛选出来的两个特征: ", X_new)
输出结果:
X,y行列数: (150, 4) (150,)
计算X单个特征变量对目标变量y的卡方值和P值: (array([ 10.81782088, 3.7107283 , 116.31261309, 67.0483602 ]), array([4.47651499e-03, 1.56395980e-01, 5.53397228e-26, 2.75824965e-15]))
[ 10.81782088 3.7107283 116.31261309 67.0483602 ]
[4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]
筛选出来的两个特征: [[1.4 0.2]
[1.4 0.2]
[1.3 0.2]
[1.5 0.2]
[1.4 0.2]
[1.7 0.4]
[1.4 0.3]
[1.5 0.2]
[1.4 0.2]
[1.5 0.1]
[1.5 0.2]
[1.6 0.2]
[1.4 0.1]
[1.1 0.1]
[1.2 0.2]
[1.5 0.4]
[1.3 0.4]
[1.4 0.3]
[1.7 0.3]
[1.5 0.3]
[1.7 0.2]
[1.5 0.4]
[1. 0.2]
[1.7 0.5]
[1.9 0.2]
[1.6 0.2]
[1.6 0.4]
[1.5 0.2]
[1.4 0.2]
[1.6 0.2]
[1.6 0.2]
[1.5 0.4]
[1.5 0.1]
[1.4 0.2]
[1.5 0.2]
[1.2 0.2]
[1.3 0.2]
[1.4 0.1]
[1.3 0.2]
[1.5 0.2]
[1.3 0.3]
[1.3 0.3]
[1.3 0.2]
[1.6 0.6]
[1.9 0.4]
[1.4 0.3]
[1.6 0.2]
[1.4 0.2]
[1.5 0.2]
[1.4 0.2]
[4.7 1.4]
[4.5 1.5]
[4.9 1.5]
[4. 1.3]
[4.6 1.5]
[4.5 1.3]
[4.7 1.6]
[3.3 1. ]
[4.6 1.3]
[3.9 1.4]
[3.5 1. ]
[4.2 1.5]
[4. 1. ]
[4.7 1.4]
[3.6 1.3]
[4.4 1.4]
[4.5 1.5]
[4.1 1. ]
[4.5 1.5]
[3.9 1.1]
[4.8 1.8]
[4. 1.3]
[4.9 1.5]
[4.7 1.2]
[4.3 1.3]
[4.4 1.4]
[4.8 1.4]
[5. 1.7]
[4.5 1.5]
[3.5 1. ]
[3.8 1.1]
[3.7 1. ]
[3.9 1.2]
[5.1 1.6]
[4.5 1.5]
[4.5 1.6]
[4.7 1.5]
[4.4 1.3]
[4.1 1.3]
[4. 1.3]
[4.4 1.2]
[4.6 1.4]
[4. 1.2]
[3.3 1. ]
[4.2 1.3]
[4.2 1.2]
[4.2 1.3]
[4.3 1.3]
[3. 1.1]
[4.1 1.3]
[6. 2.5]
[5.1 1.9]
[5.9 2.1]
[5.6 1.8]
[5.8 2.2]
[6.6 2.1]
[4.5 1.7]
[6.3 1.8]
[5.8 1.8]
[6.1 2.5]
[5.1 2. ]
[5.3 1.9]
[5.5 2.1]
[5. 2. ]
[5.1 2.4]
[5.3 2.3]
[5.5 1.8]
[6.7 2.2]
[6.9 2.3]
[5. 1.5]
[5.7 2.3]
[4.9 2. ]
[6.7 2. ]
[4.9 1.8]
[5.7 2.1]
[6. 1.8]
[4.8 1.8]
[4.9 1.8]
[5.6 2.1]
[5.8 1.6]
[6.1 1.9]
[6.4 2. ]
[5.6 2.2]
[5.1 1.5]
[5.6 1.4]
[6.1 2.3]
[5.6 2.4]
[5.5 1.8]
[4.8 1.8]
[5.4 2.1]
[5.6 2.4]
[5.1 2.3]
[5.1 1.9]
[5.9 2.3]
[5.7 2.5]
[5.2 2.3]
[5. 1.9]
[5.2 2. ]
[5.4 2.3]
[5.1 1.8]]
Process finished with exit code 0