一、特征选择–与降维的差异
相同点:效果一样,都是试图减少数据集中的特征数目
不同点:方法不同
降维: 通过对原特征进行映射后得到新的少量特征,带到降维目的
特征选择:从原特征中选择出 对模型重要的特征子集,达到降维的目的
1.1 特征选择:
提高预测准确性 构造更快,消耗更低的预测模型 能够对模型有更好的理解和解释 特征选择方法: Filter(过滤) Wrapper(封装) Embedded(嵌入) 评估变量重要性指标:信息值(权重)
InformationValue(IV)=∑i=1n(DistrGoodi−DistrBadi)∗ln(DistrGoodiDistrBadi) I n f o r m a t i o n V a l u e ( I V ) = ∑ i = 1 n ( D i s t r G o o d i − D i s t r B a d i ) ∗ l n ( D i s t r G o o d i D i s t r B a d i )
若 Distr Good > Distr Bad 权重结果为正,反之为负变量重要性的可视化:趋势分析(绘制趋势图)
二、代码: (以titanic号数据演练)
import numpy as np
import pandas as pd
def information_value(target,feature):
# 计算变量的信息值
# :param target: ndarray,真实值 1=正例,0=负例
# :param feature: ndarray 离散变量
# :return
iv_table = pd.DataFrame({'feature':feature,'y':target})
tot_good = np.sum(target)
tot_bad = len(target)-tot_good
iv_table = iv_table.groupby('feature').agg({
'y':{
'bad_count': lambda x :len(x) - np.sun(x),
'good_count': np.sum,
}
})['y']
iv_table['bad_percent'] = iv_table['bad_count']/tot_bad
iv_table['good_percent'] = iv_table['good_count']/tot_good
iv_table['woe'] = np.log(iv_table['bad_count']/iv_table['bad_count'])
iv_table['iv'] = (iv_table['good_percent'] - iv_table['bad_percent']) * iv_table['woe']
iv_value = np.sum(iv_table['iv'])
return iv_value,iv_table[['bad_count','bad_percent','good_percent','good_count','woe','iv']]
titanic = pd.read_csv('./data/transaction.txt')
titanic.head()
feature = titanic.Pclass
target = titanic.Survived
iv_value,iv_table = information_value(target,feature)
print(iv_table)
print('information_value',iv_value)
#information_value >0.4 就很好了