【一周算法进阶】--任务二特征工程

最新推荐文章于 2020-10-24 00:40:19 发布

roy_blue

最新推荐文章于 2020-10-24 00:40:19 发布

阅读量670

点赞数

分类专栏： # 数据挖掘比赛整理文章标签： IV值特征工程随机森林选择特征

本文链接：https://blog.csdn.net/wxq_1993/article/details/86695650

版权

数据挖掘比赛整理专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Tast2 特征工程

特征选择：分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM），进行模型评估。

 1.什么是IV值？ Information Value(信息价值，信息量)

本人是第一次接触IV值，先来一波介绍：

简单来说，IV值是用来衡量变量的预测能力，类似的有基尼系数,信息增益等。我们在构建模型时，经常需要对自变量进行筛选。比如我们有200个候选自变量，通常情况下，不会直接把200个变量直接放到模型中去进行拟合训练，而是会用一些方法，从这200个自变量中挑选一些出来，放进模型，形成入模变量列表。

2.IV值的计算

先来引入WOE，Weight of Evidence(证据权重)。首先把这个变量进行分组处理（也叫离散化、分箱等等，说的都是一个意思）。分组后，对于第i组，WOE的计算公式如下：

[点击并拖拽以移动]

其中，pyi是这个组中响应客户（风险模型中，对应的是违约客户，总之，指的是模型中预测变量取值为“是”或者说1的个体）占所有样本中所有响应客户的比例，pni是这个组中未响应客户占样本中所有未响应客户的比例，#yi是这个组中响应客户的数量，#ni是这个组中未响应客户的数量，#yT是样本中所有响应客户的数量，#nT是样本中所有未响应客户的数量。

这个式子表示的是当前这个组中响应的客户和未响应客户的比值，和所有样本中这个比值的差异。这个差异是用这两个比值的比值，再取对数来表示的。WOE越大，这种差异越大，这个分组里的样本响应的可能性就越大，WOE越小，差异越小，这个分组里的样本响应的可能性就越小。

IV值的计算：

[点击并拖拽以移动]

3.IV 值不能自动处理变量的分组中出现响应比例为0或100%的情况。那么，遇到响应比例为0或者100%的情况，我们应该怎么做呢？建议如下：

（1）当变量一个分组中，响应数=0时，

[点击并拖拽以移动]

（2）当变量一个分组中，没有响应的数量 = 0时，

[点击并拖拽以移动]

（1）如果可能，直接把这个分组做成一个规则，作为模型的前置条件或补充条件；

（2）重新对变量进行离散化或分组，使每个分组的响应比例都不为0且不为100%，尤其是当一个分组个体数很小时（比如小于100个），强烈建议这样做，因为本身把一个分组个体数弄得很小就不是太合理。

（3）如果上面两种方法都无法使用，建议人工把该分组的响应数和非响应的数量进行一定的调整。如果响应数原本为0，可以人工调整响应数为1，如果非响应数原本为0，可以人工调整非响应数为1.

4.随机森林选择特征

用随机森林进行特征重要性评估的思想其实很简单，说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献，然后取个平均值，最后比一比特征之间的贡献大小。通常可以用基尼指数（Gini index）或者袋外数据（OOB）错误率作为评价指标来衡量。

随机森林提供了两种特征选择的方法：mean decrease impurity和mean decrease accuracy。
平均不纯度减少 mean decrease impurity

利用不纯度可以确定节点（最优条件）. 对于分类问题，常采用基尼不纯度/信息增益；对于回归问题，常采用方差/最小二乘拟合。

训练决策树时，可以计算每个特征减少了多少树的不纯度。对于一个决策树森林来说，可以算出每个特征平均减少了多少不纯度，并把它平均减少的不纯度作为特征选择的值。
【缺点】
1）该方法存在偏向，对具有更多类别的变量更有利；
2）label存在多个关联特征（任意一个都可以作为优秀特征），则一旦某个特征被选择, 其他特征的重要性会急剧降低。这会造成误解：错误的认为先被选中的特征是很重要的，而其余的特征是不重要的。

平均精确率减少 Mean decrease accuracy

直接度量每个特征对模型精确率的影响。

打乱每个特征的特征值顺序，并且度量顺序变动对模型的精确率的影响。
对于不重要的变量来说，打乱顺序对模型的精确率影响不会太大，但是对于重要的变量来说，打乱顺序就会降低模型的精确率。

1. 导入相关包&读取数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,Imputer

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

data_original=pd.read_csv('data.csv',skipinitialspace=True)
data=data_original.copy()
data.head(5)

	Unnamed: 0	custid	trade_no	bank_card_no	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	student_feature	repayment_capability	number_of_trans_from_2011	first_transaction_time	historical_trans_amount	historical_trans_day	rank_trad_1_month	trans_amount_3_month	avg_consume_less_12_valid_month	abs	top_trans_count_last_1_month	avg_price_last_12_month	avg_price_top_last_12_valid_month	reg_preference_for_trad	trans_top_time_last_1_month	trans_top_time_last_6_month	consume_top_time_last_1_month	consume_top_time_last_6_month	cross_consume_count_last_1_month	trans_fail_top_count_enum_last_1_month	trans_fail_top_count_enum_last_6_month	trans_fail_top_count_enum_last_12_month	consume_mini_time_last_1_month	max_cumulative_consume_later_1_month	max_consume_count_later_6_month	pawns_auctions_trusts_consume_last_1_month	pawns_auctions_trusts_consume_last_6_month	status	source	first_transaction_day	trans_day_last_12_month	id_name	apply_score	apply_credibility	query_org_count	query_finance_count	query_cash_count	query_sum_count	latest_query_time	latest_one_month_apply	latest_three_month_apply	latest_six_month_apply	loans_score	loans_credibility_behavior	loans_count	loans_settle_count	loans_overdue_count	loans_org_count_behavior	consfin_org_count_behavior	loans_cash_count	latest_one_month_loan	latest_three_month_loan	latest_six_month_loan	history_suc_fee	history_fail_fee	latest_one_month_suc	latest_one_month_fail	loans_long_time	loans_latest_time	loans_credit_limit	loans_credibility_limit	loans_org_count_current	loans_product_count	loans_max_limit	loans_avg_limit	consfin_credit_limit	consfin_credibility	consfin_org_count_current	consfin_product_count	consfin_max_limit	consfin_avg_limit	latest_query_day	loans_latest_day
0	5	2791858	20180507115231274000000023057383	卡号1	0.01	0.99	0	0.90	0.55	0.313	17.0	27.0	26.0	3.0	NaN	19890	30.0	20130817.0	149050	151.0	0.40	34030	7.0	3920	0.15	1020	0.55	一线城市	4.0	19.0	4.0	19.0	1.0	1.0	2.0	2.0	5.0	2170	6.0	1970	18040	1	xs	1738.0	85.0	蒋红	583.0	79.0	8.0	2.0	6.0	10.0	2018-04-25	2.0	5.0	8.0	552.0	73.0	37.0	34.0	2.0	10.0	1.0	9.0	1.0	1.0	13.0	37.0	7.0	1.0	0.0	341.0	2018-04-19	2200.0	72.0	9.0	10.0	2900.0	1688.0	1200.0	75.0	1.0	2.0	1200.0	1200.0	12.0	18.0
1	10	534047	20180507121002192000000023073000	卡号1	0.02	0.94	2000	1.28	1.00	0.458	19.0	30.0	14.0	4.0	1.0	16970	23.0	20160402.0	302910	224.0	0.35	10590	5.0	6950	0.05	1210	0.50	一线城市	13.0	30.0	13.0	30.0	0.0	0.0	3.0	3.0	330.0	2100	9.0	1820	15680	0	xs	779.0	84.0	崔向朝	653.0	73.0	7.0	4.0	2.0	8.0	2018-05-03	2.0	6.0	8.0	635.0	76.0	37.0	36.0	0.0	17.0	5.0	12.0	2.0	2.0	8.0	49.0	4.0	2.0	1.0	353.0	2018-05-05	2000.0	74.0	12.0	12.0	3500.0	1758.0	15100.0	80.0	5.0	6.0	22800.0	9360.0	4.0	2.0
2	12	2849787	20180507125159718000000023114911	卡号1	0.04	0.96	0	1.00	1.00	0.114	13.0	68.0	22.0	1.0	NaN	9710	9.0	20170617.0	11520	31.0	1.00	5710	5.0	840	0.65	570	0.65	一线城市	0.0	68.0	0.0	68.0	0.0	3.0	6.0	6.0	0.0	0	3.0	0	0	1	xs	338.0	95.0	王中云	654.0	76.0	11.0	5.0	5.0	16.0	2018-05-05	5.0	5.0	14.0	633.0	83.0	4.0	2.0	0.0	3.0	1.0	2.0	2.0	2.0	4.0	2.0	2.0	1.0	1.0	157.0	2018-05-01	1500.0	77.0	2.0	2.0	1600.0	1250.0	4200.0	87.0	1.0	1.0	4200.0	4200.0	2.0	6.0
3	13	1809708	20180507121358683000000388283484	卡号1	0.00	0.96	2000	0.13	0.57	0.777	22.0	14.0	6.0	3.0	NaN	6210	33.0	20130516.0	491130	360.0	0.15	91690	7.0	46850	0.05	1290	0.45	三线城市	6.0	8.0	6.0	8.0	0.0	1.0	8.0	8.0	31700.0	8140	9.0	2700	27970	0	xs	1831.0	82.0	何洋洋	595.0	79.0	12.0	7.0	4.0	22.0	2018-05-05	3.0	16.0	17.0	542.0	75.0	85.0	81.0	4.0	22.0	5.0	17.0	2.0	4.0	34.0	91.0	26.0	2.0	0.0	355.0	2018-05-03	1800.0	74.0	17.0	18.0	3200.0	1541.0	16300.0	80.0	5.0	5.0	30000.0	12180.0	2.0	4.0
4	14	2499829	20180507115448545000000388205844	卡号1	0.01	0.99	0	0.46	1.00	0.175	13.0	66.0	42.0	1.0	NaN	11150	12.0	20170312.0	61470	63.0	0.65	9770	6.0	760	1.00	1110	0.50	一线城市	0.0	66.0	0.0	66.0	0.0	3.0	3.0	3.0	0.0	1000	3.0	0	6410	1	xs	435.0	88.0	赵洋	541.0	75.0	11.0	3.0	4.0	14.0	2018-04-15	6.0	8.0	9.0	479.0	73.0	37.0	32.0	6.0	12.0	2.0	10.0	0.0	0.0	10.0	36.0	25.0	0.0	0.0	360.0	2018-01-07	1800.0	72.0	10.0	10.0	2300.0	1630.0	8300.0	79.0	2.0	2.0	8400.0	8250.0	22.0	120.0

2. 删除无关特征

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB

data.drop(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no', 'source','id_name'], axis=1, inplace=True)
object_cols = [col for col in data.columns if data[col].dtypes == 'O']
data_obj=data[object_cols]
data_num=data.drop(object_cols,axis=1)

#缺失值填充
imputer=Imputer(strategy='mean')
mean_num=imputer.fit_transform(data_num)
data_num=pd.DataFrame(mean_num,columns=data_num.columns)
data_obj.ffill(inplace=True)
#One-HotEncoder
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)

#['latest_query_time']  ['loans_latest_time']
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday

data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday

data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)

data=pd.concat([data_num,data_obj],axis=1)
data.shape

(4754, 90)

from sklearn.model_selection import train_test_split
y=data['status']
X=data.drop(['status'],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)

# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

#构架模型并评价
from sklearn.linear_model import LogisticRegressionCV

clf=LogisticRegressionCV(class_weight='balanced',max_iter=5000)
clf.fit(X_train,y_train)
model_metrics(clf, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.4956 测试集： 0.5032
[auc值] 训练集： 0.5861 测试集： 0.5765

3.计算IV值

import math
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv


iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv

[('historical_trans_amount', 2.6609646134512865),
 ('trans_amount_3_month', 2.5546436077538357),
 ('repayment_capability', 2.327229251967252),
 ('pawns_auctions_trusts_consume_last_6_month', 2.220777389641486),
 ('abs', 1.966985825643712),
 ('max_cumulative_consume_later_1_month', 1.4598660465564153),
 ('pawns_auctions_trusts_consume_last_1_month', 0.8530625616084101),
 ('avg_price_last_12_month', 0.7281431950917352),
 ('take_amount_in_later_12_month_highest', 0.4407207265219969),
 ('latest_query_time_month', 0.25139126628755865),
 ('loans_latest_time_weekday', 0.24326338644309412),
 ('history_fail_fee', 0.23601952893571299),
 ('loans_latest_time_month', 0.23316679232272933),
 ('latest_query_day', 0.23165030755336188),
 ('history_suc_fee', 0.23132587006862826),
 ('trans_days_interval', 0.23127346695672282),
 ('trans_activity_day', 0.23089021521474926),
 ('latest_six_month_apply', 0.23004076549705482),
 ('apply_score', 0.22999736959648898),
 ('loans_avg_limit', 0.22937233933022275),
 ('loans_credibility_limit', 0.22923404864220617),
 ('二线城市', 0.22835785178159998),
 ('low_volume_percent', 0.22831922306127952),
 ('consfin_credibility', 0.22804472290267083),
 ('avg_price_top_last_12_valid_month', 0.22804418697211443),
 ('latest_three_month_loan', 0.22786568449353656),
 ('historical_trans_day', 0.22785892580201067),
 ('latest_one_month_loan', 0.2259858987958161),
 ('trans_day_last_12_month', 0.2258295769673027),
 ('loans_cash_count', 0.22582167536745912),
 ('loans_org_count_current', 0.22582167536745912),
 ('first_transaction_day', 0.22577667440590374),
 ('first_transaction_time', 0.22558029316437583),
 ('trans_amount_increase_rate_lately', 0.22553777250765294),
 ('middle_volume_percent', 0.22535903805135094),
 ('consume_top_time_last_6_month', 0.2253530270376462),
 ('query_org_count', 0.22529059153249648),
 ('一线城市', 0.2250434530120855),
 ('trans_top_time_last_6_month', 0.22440575809597219),
 ('trans_fail_top_count_enum_last_1_month', 0.2242031888186113),
 ('loans_org_count_behavior', 0.22411751635509966),
 ('latest_six_month_loan', 0.22372060084022297),
 ('境外', 0.22366745673000382),
 ('loans_product_count', 0.223611713328623),
 ('consfin_avg_limit', 0.22347209785006059),
 ('trans_days_interval_filter', 0.22340299880606143),
 ('number_of_trans_from_2011', 0.22308989504593096),
 ('apply_credibility', 0.22274532659739935),
 ('loans_overdue_count', 0.22262741793816765),
 ('loans_score', 0.2225002169626543),
 ('loans_latest_day', 0.22241446845462567),
 ('consfin_credit_limit', 0.22228309104879887),
 ('loans_count', 0.22227945107950234),
 ('loans_credibility_behavior', 0.22203257178296298),
 ('loans_settle_count', 0.2219171432554008),
 ('rank_trad_1_month', 0.2218401640065109),
 ('query_cash_count', 0.2216362408399449),
 ('loans_long_time', 0.22161254075577275),
 ('regional_mobility', 0.22150812112017015),
 ('latest_query_time_weekday', 0.2215008902334139),
 ('query_sum_count', 0.2210085646317297),
 ('consume_top_time_last_1_month', 0.2206710654401162),
 ('consume_mini_time_last_1_month', 0.22038175378437908),
 ('trans_fail_top_count_enum_last_12_month', 0.22027549211946645),
 ('consfin_max_limit', 0.22016256897174105),
 ('trans_activity_month', 0.22015938020797166),
 ('top_trans_count_last_1_month', 0.22013392802621778),
 ('latest_one_month_apply', 0.2198386582706197),
 ('consfin_product_count', 0.21981612729230302),
 ('max_consume_count_later_6_month', 0.21980267330646783),
 ('latest_three_month_apply', 0.219745341760752),
 ('consfin_org_count_behavior', 0.21964389703494608),
 ('consfin_org_count_current', 0.21964389703494608),
 ('avg_consume_less_12_valid_month', 0.21883300876505346),
 ('trans_fail_top_count_enum_last_6_month', 0.21882948763455295),
 ('cross_consume_count_last_1_month', 0.21869363923411914),
 ('transd_mcc', 0.21865573796739254),
 ('其他城市', 0.2185500702073389),
 ('student_feature', 0.21833192508051125),
 ('loans_credit_limit', 0.21821927461429466),
 ('trans_top_time_last_1_month', 0.21803681758281673),
 ('query_finance_count', 0.21790525591920654),
 ('loans_max_limit', 0.21760772188869903),
 ('三线城市', 0.21755723341508837),
 ('latest_one_month_fail', 0.21753031430667408),
 ('is_high_user', 0.2175215044170788),
 ('latest_one_month_suc', 0.21715553601300325),
 ('railway_consume_count_last_12_month', 0.21687601425054276),
 ('jewelry_consume_count_last_6_month', 0.21687601425054276)]

threshold = 0.1
data_index = []
for i in range(len(iv)):
    if iv[i][1] < threshold:
        data_index.append(iv[i])
        print(iv[i])
#X_train.drop(data_index, axis=1, inplace=True)

删除无用和预测能力弱的特征，也就是IV值<0.1，但经过计算发现，数据集中最小值为(‘jewelry_consume_count_last_6_month’, 0.21687601425054276)，因此不用这一步不需要删除特征。

4.使用随机森林选择特征

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数：', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)
feature_importance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)

袋外分数： 0.7868951006913135
[准确率] 训练集： 0.8200 测试集： 0.7800
[auc值] 训练集： 0.9010 测试集： 0.7686

feature_importance1

[('0.1361', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0933', 'history_fail_fee'),
 ('0.0779', 'loans_score'),
 ('0.0513', 'loans_overdue_count'),
 ('0.0508', 'apply_score'),
 ('0.0379', 'latest_one_month_fail'),
 ('0.0365', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0284', 'trans_fail_top_count_enum_last_12_month'),
 ('0.0199', 'trans_day_last_12_month'),
 ('0.0191', 'latest_one_month_suc'),
 ('0.0180', 'max_cumulative_consume_later_1_month'),
 ('0.0142', 'consfin_avg_limit'),
 ('0.0140', 'rank_trad_1_month'),
 ('0.0132', 'trans_amount_3_month'),
 ('0.0128', 'consume_top_time_last_1_month'),
 ('0.0125', 'latest_query_day'),
 ('0.0109', 'historical_trans_amount'),
 ('0.0099', 'trans_top_time_last_1_month'),
 ('0.0099', 'trans_activity_day'),
 ('0.0099', 'historical_trans_day'),
 ('0.0098', 'history_suc_fee'),
 ('0.0094', 'first_transaction_time'),
 ('0.0090', 'loans_latest_day'),
 ('0.0090', 'consfin_credit_limit'),
 ('0.0085', 'loans_count'),
 ('0.0084', 'loans_settle_count'),
 ('0.0083', 'trans_amount_increase_rate_lately'),
 ('0.0083', 'top_trans_count_last_1_month'),
 ('0.0079', 'latest_three_month_loan'),
 ('0.0079', 'first_transaction_day'),
 ('0.0079', 'consume_top_time_last_6_month'),
 ('0.0077', 'trans_days_interval'),
 ('0.0077', 'avg_price_last_12_month'),
 ('0.0075', 'loans_long_time'),
 ('0.0074', 'repayment_capability'),
 ('0.0073', 'consfin_max_limit'),
 ('0.0070', 'trans_top_time_last_6_month'),
 ('0.0070', 'trans_days_interval_filter'),
 ('0.0070', 'pawns_auctions_trusts_consume_last_6_month'),
 ('0.0070', 'latest_three_month_apply'),
 ('0.0068', 'trans_activity_month'),
 ('0.0067', 'loans_avg_limit'),
 ('0.0063', 'pawns_auctions_trusts_consume_last_1_month'),
 ('0.0061', 'consfin_credibility'),
 ('0.0060', 'loans_max_limit'),
 ('0.0058', 'loans_org_count_behavior'),
 ('0.0058', 'latest_six_month_loan'),
 ('0.0058', 'abs'),
 ('0.0054', 'apply_credibility'),
 ('0.0053', 'consume_mini_time_last_1_month'),
 ('0.0050', 'transd_mcc'),
 ('0.0050', 'consfin_product_count'),
 ('0.0049', 'avg_price_top_last_12_valid_month'),
 ('0.0048', 'latest_six_month_apply'),
 ('0.0047', 'take_amount_in_later_12_month_highest'),
 ('0.0047', 'loans_product_count'),
 ('0.0046', 'loans_cash_count'),
 ('0.0046', 'consfin_org_count_current'),
 ('0.0042', 'number_of_trans_from_2011'),
 ('0.0040', 'query_cash_count'),
 ('0.0040', 'loans_org_count_current'),
 ('0.0039', 'query_sum_count'),
 ('0.0039', 'loans_credit_limit'),
 ('0.0038', 'middle_volume_percent'),
 ('0.0036', 'consfin_org_count_behavior'),
 ('0.0035', 'max_consume_count_later_6_month'),
 ('0.0035', 'loans_credibility_behavior'),
 ('0.0034', 'query_finance_count'),
 ('0.0033', 'loans_latest_time_weekday'),
 ('0.0031', 'latest_one_month_apply'),
 ('0.0030', 'query_org_count'),
 ('0.0029', 'loans_credibility_limit'),
 ('0.0025', 'loans_latest_time_month'),
 ('0.0024', 'latest_query_time_weekday'),
 ('0.0018', 'latest_one_month_loan'),
 ('0.0016', 'latest_query_time_month'),
 ('0.0015', 'low_volume_percent'),
 ('0.0015', 'avg_consume_less_12_valid_month'),
 ('0.0011', 'cross_consume_count_last_1_month'),
 ('0.0009', '一线城市'),
 ('0.0009', 'regional_mobility'),
 ('0.0007', 'student_feature'),
 ('0.0004', '三线城市'),
 ('0.0000', '境外'),
 ('0.0000', '其他城市'),
 ('0.0000', '二线城市'),
 ('0.0000', 'railway_consume_count_last_12_month'),
 ('0.0000', 'jewelry_consume_count_last_6_month'),
 ('0.0000', 'is_high_user')]

X_train.head(5)

	low_volume_percent	middle_volume_percent	take_amount_in_later_12_month_highest	trans_amount_increase_rate_lately	trans_activity_month	trans_activity_day	transd_mcc	trans_days_interval_filter	trans_days_interval	regional_mobility	student_feature	repayment_capability	number_of_trans_from_2011	first_transaction_time	historical_trans_amount	historical_trans_day	rank_trad_1_month	trans_amount_3_month	avg_consume_less_12_valid_month	abs	top_trans_count_last_1_month	avg_price_last_12_month	avg_price_top_last_12_valid_month	trans_top_time_last_1_month	trans_top_time_last_6_month	consume_top_time_last_1_month	consume_top_time_last_6_month	cross_consume_count_last_1_month	trans_fail_top_count_enum_last_1_month	trans_fail_top_count_enum_last_6_month	trans_fail_top_count_enum_last_12_month	consume_mini_time_last_1_month	max_cumulative_consume_later_1_month	max_consume_count_later_6_month	pawns_auctions_trusts_consume_last_1_month	pawns_auctions_trusts_consume_last_6_month	first_transaction_day	trans_day_last_12_month	apply_score	apply_credibility	query_org_count	query_finance_count	query_cash_count	query_sum_count	latest_one_month_apply	latest_three_month_apply	latest_six_month_apply	loans_score	loans_credibility_behavior	loans_count	loans_settle_count	loans_overdue_count	loans_org_count_behavior	consfin_org_count_behavior	loans_cash_count	latest_one_month_loan	latest_three_month_loan	latest_six_month_loan	history_suc_fee	history_fail_fee	latest_one_month_suc	latest_one_month_fail	loans_long_time	loans_credit_limit	loans_credibility_limit	loans_org_count_current	loans_product_count	loans_max_limit	loans_avg_limit	consfin_credit_limit	consfin_credibility	consfin_org_count_current	consfin_product_count	consfin_max_limit	consfin_avg_limit	latest_query_day	loans_latest_day	一线城市	三线城市	latest_query_time_month	latest_query_time_weekday	loans_latest_time_month	loans_latest_time_weekday
110	0.01	0.99	4000.0	0.96	1.00	0.405	16.0	29.0	28.0	1.0	1.000000	17570.0	13.0	20170217.0	181770.0	150.0	0.85	15610.0	7.0	2650.0	1.00	1220.0	0.45	0.0	29.0	0.0	29.0	0.0	6.0	9.0	9.0	0.0	220.0	6.0	0.0	10160.0	458.0	99.0	535.0	73.0	16.0	6.0	7.0	24.0	5.0	12.0	15.0	498.0	73.0	92.0	77.0	7.0	27.0	7.0	20.0	1.0	3.0	34.0	85.0	52.0	0.0	3.0	356.0	2400.0	72.0	20.0	22.0	5000.0	1845.0	10600.0	81.0	7.0	7.0	15600.0	8228.0	0.0	9.0	1	0	5	4	4	2
3394	0.03	0.97	500.0	0.87	1.00	0.205	18.0	27.0	27.0	3.0	1.001139	15310.0	12.0	20170331.0	63350.0	74.0	0.65	12200.0	6.0	3460.0	0.40	630.0	0.65	14.0	17.0	14.0	17.0	1.0	1.0	4.0	9.0	0.0	470.0	4.0	470.0	2060.0	416.0	82.0	540.0	81.0	8.0	3.0	3.0	9.0	1.0	3.0	6.0	510.0	76.0	19.0	16.0	3.0	7.0	5.0	2.0	1.0	1.0	5.0	22.0	11.0	1.0	0.0	357.0	2400.0	73.0	2.0	2.0	2600.0	1800.0	16300.0	78.0	5.0	5.0	21600.0	7160.0	30.0	27.0	1	0	4	5	4	1
3052	0.02	0.86	0.0	1.98	0.70	0.205	18.0	53.0	33.0	2.0	1.001139	12240.0	28.0	20141110.0	97190.0	93.0	0.45	33280.0	8.0	1060.0	0.30	930.0	0.55	11.0	21.0	11.0	21.0	0.0	0.0	4.0	21.0	0.0	1950.0	12.0	1950.0	8240.0	1288.0	82.0	516.0	75.0	14.0	8.0	6.0	19.0	5.0	8.0	12.0	482.0	77.0	16.0	16.0	2.0	8.0	5.0	3.0	0.0	0.0	7.0	20.0	5.0	0.0	0.0	314.0	1400.0	66.0	3.0	3.0	2300.0	1500.0	10400.0	82.0	5.0	5.0	13800.0	10320.0	3.0	137.0	1	0	5	4	12	3
490	0.02	0.81	1000.0	1.49	0.73	0.555	23.0	15.0	8.0	4.0	1.000000	4320.0	40.0	20130817.0	373700.0	356.0	0.30	61940.0	8.0	26200.0	0.10	1390.0	0.45	15.0	15.0	15.0	15.0	1.0	8.0	8.0	8.0	42936.0	3090.0	7.0	3140.0	67720.0	1738.0	82.0	491.0	74.0	11.0	6.0	4.0	12.0	1.0	4.0	7.0	448.0	78.0	40.0	22.0	7.0	17.0	11.0	6.0	0.0	3.0	18.0	40.0	78.0	0.0	10.0	356.0	2600.0	76.0	6.0	7.0	4500.0	2500.0	6600.0	78.0	11.0	12.0	17400.0	6418.0	20.0	51.0	0	1	4	1	3	5
1	0.02	0.94	2000.0	1.28	1.00	0.458	19.0	30.0	14.0	4.0	1.000000	16970.0	23.0	20160402.0	302910.0	224.0	0.35	10590.0	5.0	6950.0	0.05	1210.0	0.50	13.0	30.0	13.0	30.0	0.0	0.0	3.0	3.0	330.0	2100.0	9.0	1820.0	15680.0	779.0	84.0	653.0	73.0	7.0	4.0	2.0	8.0	2.0	6.0	8.0	635.0	76.0	37.0	36.0	0.0	17.0	5.0	12.0	2.0	2.0	8.0	49.0	4.0	2.0	1.0	353.0	2000.0	74.0	12.0	12.0	3500.0	1758.0	15100.0	80.0	5.0	6.0	22800.0	9360.0	4.0	2.0	1	0	5	3	5	5

useless=[]
for feature in X_train.columns:
    if feature in [t[1] for t in feature_importance1[30:]]:
        useless.append(feature)

X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)

5. 训练数据

#数据归一化评价
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score,roc_curve

lr_model = LogisticRegression(C = 0.1, penalty = 'l1')
svm_model = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
dt_model = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb_model = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5, 
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5, 
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgbm_model = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11, 
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1, seed=27)
gbdt_model=GradientBoostingClassifier(n_estimators=100)

models={'LR':lr_model, 'SVM':svm_model, 'DT':dt_model, 'GBDT':gbdt_model, 'XGBoost':xgb_model, 'LGBM':lgbm_model}


df_result=pd.DataFrame(columns=('model','accuracy','precision','recall','f1_score','auc'))
row=0
#定义评价函数
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    p=precision_score(y,y_pre)
    r=recall_score(y,y_pre)
    f1=f1_score(y,y_pre)
    return acc,p,r,f1

for name,model in models.items():
    print(name,'start training...')
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    y_proba=model.predict_proba(X_test)
    acc,p,r,f1=evaluate(y_pred,y_test)
    auc=roc_auc_score(y_test,y_proba[:,1])
    df_result.loc[row]=[name,acc,p,r,f1,auc]
    row+=1
print(df_result)

LR start training...
SVM start training...
DT start training...
GBDT start training...
XGBoost start training...
LGBM start training...
     model  accuracy  precision    recall  f1_score       auc
0       LR  0.786966   0.670807  0.300836  0.415385  0.785507
1      SVM  0.774352   0.707865  0.175487  0.281250  0.789644
2       DT  0.755431   0.518797  0.384401  0.441600  0.716664
3     GBDT  0.784163   0.616438  0.376045  0.467128  0.767292
4  XGBoost  0.791170   0.654822  0.359331  0.464029  0.782216
5     LGBM  0.791871   0.646226  0.381616  0.479860  0.777396

参考：

1.算法实践2

https://zhuanlan.zhihu.com/p/55913000

2.随机森林对特征的重要性

https://blog.csdn.net/zjuPeco/article/details/77371645?locationNum=7&fps=1

3.数据挖掘模型中的IV和WOE

https://blog.csdn.net/kevin7658/article/details/50780391

4.贷款用户逾期情况分析

https://blog.csdn.net/a786150017/article/details/84573202
img 小部件

roy_blue

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【一周算法进阶】--任务二特征工程

Tast2 特征工程特征选择：分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM），进行模型评估。 1.什么是IV值？ Information Value(信息价值，信息量)本人是第一次接触IV值，先来一波介绍：简单来说，IV值是用来衡量变量的预测能力，类似的有基尼系数,信息增益等。我们在构建模...
复制链接

扫一扫