Tast2 特征工程
特征选择:分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。
1.什么是IV值? Information Value(信息价值,信息量)
本人是第一次接触IV值,先来一波介绍:
简单来说,IV值是用来衡量变量的预测能力,类似的有基尼系数,信息增益等。我们在构建模型时,经常需要对自变量进行筛选。比如我们有200个候选自变量,通常情况下,不会直接把200个变量直接放到模型中去进行拟合训练,而是会用一些方法,从这200个自变量中挑选一些出来,放进模型,形成入模变量列表。
2.IV值的计算
先来引入WOE,Weight of Evidence(证据权重)。首先把这个变量进行分组处理(也叫离散化、分箱等等,说的都是一个意思)。分组后,对于第i组,WOE的计算公式如下:
[点击并拖拽以移动]
其中,pyi是这个组中响应客户(风险模型中,对应的是违约客户,总之,指的是模型中预测变量取值为“是”或者说1的个体)占所有样本中所有响应客户的比例,pni是这个组中未响应客户占样本中所有未响应客户的比例,#yi是这个组中响应客户的数量,#ni是这个组中未响应客户的数量,#yT是样本中所有响应客户的数量,#nT是样本中所有未响应客户的数量。
这个式子表示的是当前这个组中响应的客户和未响应客户的比值,和所有样本中这个比值的差异。这个差异是用这两个比值的比值,再取对数来表示的。WOE越大,这种差异越大,这个分组里的样本响应的可能性就越大,WOE越小,差异越小,这个分组里的样本响应的可能性就越小。
IV值的计算:
[点击并拖拽以移动]
3.IV 值不能自动处理变量的分组中出现响应比例为0或100%的情况。那么,遇到响应比例为0或者100%的情况,我们应该怎么做呢?建议如下:
(1)当变量一个分组中,响应数=0时,
[点击并拖拽以移动]
(2)当变量一个分组中,没有响应的数量 = 0时,
[点击并拖拽以移动]
(1)如果可能,直接把这个分组做成一个规则,作为模型的前置条件或补充条件;
(2)重新对变量进行离散化或分组,使每个分组的响应比例都不为0且不为100%,尤其是当一个分组个体数很小时(比如小于100个),强烈建议这样做,因为本身把一个分组个体数弄得很小就不是太合理。
(3)如果上面两种方法都无法使用,建议人工把该分组的响应数和非响应的数量进行一定的调整。如果响应数原本为0,可以人工调整响应数为1,如果非响应数原本为0,可以人工调整非响应数为1.
4.随机森林选择特征
用随机森林进行特征重要性评估的思想其实很简单,说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献,然后取个平均值,最后比一比特征之间的贡献大小。通常可以用基尼指数(Gini index)或者袋外数据(OOB)错误率作为评价指标来衡量。
随机森林提供了两种特征选择的方法:mean decrease impurity和mean decrease accuracy。
平均不纯度减少 mean decrease impurity
利用不纯度可以确定节点(最优条件). 对于分类问题,常采用基尼不纯度/信息增益;对于回归问题,常采用方差/最小二乘拟合。
训练决策树时,可以计算每个特征减少了多少树的不纯度。对于一个决策树森林来说,可以算出每个特征平均减少了多少不纯度,并把它平均减少的不纯度作为特征选择的值。
【缺点】
1)该方法存在偏向, 对具有更多类别的变量更有利;
2)label存在多个关联特征(任意一个都可以作为优秀特征), 则一旦某个特征被选择, 其他特征的重要性会急剧降低。这会造成误解:错误的认为先被选中的特征是很重要的,而其余的特征是不重要的。
平均精确率减少 Mean decrease accuracy
直接度量每个特征对模型精确率的影响。
打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。
对于不重要的变量来说,打乱顺序对模型的精确率影响不会太大,但是对于重要的变量来说,打乱顺序就会降低模型的精确率。
1. 导入相关包&读取数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,Imputer
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
data_original=pd.read_csv('data.csv',skipinitialspace=True)
data=data_original.copy()
data.head(5)
Unnamed: 0 | custid | trade_no | bank_card_no | low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | student_feature | repayment_capability | is_high_user | number_of_trans_from_2011 | first_transaction_time | historical_trans_amount | historical_trans_day | rank_trad_1_month | trans_amount_3_month | avg_consume_less_12_valid_month | abs | top_trans_count_last_1_month | avg_price_last_12_month | avg_price_top_last_12_valid_month | reg_preference_for_trad | trans_top_time_last_1_month | trans_top_time_last_6_month | consume_top_time_last_1_month | consume_top_time_last_6_month | cross_consume_count_last_1_month | trans_fail_top_count_enum_last_1_month | trans_fail_top_count_enum_last_6_month | trans_fail_top_count_enum_last_12_month | consume_mini_time_last_1_month | max_cumulative_consume_later_1_month | max_consume_count_later_6_month | railway_consume_count_last_12_month | pawns_auctions_trusts_consume_last_1_month | pawns_auctions_trusts_consume_last_6_month | jewelry_consume_count_last_6_month | status | source | first_transaction_day | trans_day_last_12_month | id_name | apply_score | apply_credibility | query_org_count | query_finance_count | query_cash_count | query_sum_count | latest_query_time | latest_one_month_apply | latest_three_month_apply | latest_six_month_apply | loans_score | loans_credibility_behavior | loans_count | loans_settle_count | loans_overdue_count | loans_org_count_behavior | consfin_org_count_behavior | loans_cash_count | latest_one_month_loan | latest_three_month_loan | latest_six_month_loan | history_suc_fee | history_fail_fee | latest_one_month_suc | latest_one_month_fail | loans_long_time | loans_latest_time | loans_credit_limit | loans_credibility_limit | loans_org_count_current | loans_product_count | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2791858 | 20180507115231274000000023057383 | 卡号1 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | NaN | 19890 | 0 | 30.0 | 20130817.0 | 149050 | 151.0 | 0.40 | 34030 | 7.0 | 3920 | 0.15 | 1020 | 0.55 | 一线城市 | 4.0 | 19.0 | 4.0 | 19.0 | 1.0 | 1.0 | 2.0 | 2.0 | 5.0 | 2170 | 6.0 | 0.0 | 1970 | 18040 | 0.0 | 1 | xs | 1738.0 | 85.0 | 蒋红 | 583.0 | 79.0 | 8.0 | 2.0 | 6.0 | 10.0 | 2018-04-25 | 2.0 | 5.0 | 8.0 | 552.0 | 73.0 | 37.0 | 34.0 | 2.0 | 10.0 | 1.0 | 9.0 | 1.0 | 1.0 | 13.0 | 37.0 | 7.0 | 1.0 | 0.0 | 341.0 | 2018-04-19 | 2200.0 | 72.0 | 9.0 | 10.0 | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 10 | 534047 | 20180507121002192000000023073000 | 卡号1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | 1.0 | 16970 | 0 | 23.0 | 20160402.0 | 302910 | 224.0 | 0.35 | 10590 | 5.0 | 6950 | 0.05 | 1210 | 0.50 | 一线城市 | 13.0 | 30.0 | 13.0 | 30.0 | 0.0 | 0.0 | 3.0 | 3.0 | 330.0 | 2100 | 9.0 | 0.0 | 1820 | 15680 | 0.0 | 0 | xs | 779.0 | 84.0 | 崔向朝 | 653.0 | 73.0 | 7.0 | 4.0 | 2.0 | 8.0 | 2018-05-03 | 2.0 | 6.0 | 8.0 | 635.0 | 76.0 | 37.0 | 36.0 | 0.0 | 17.0 | 5.0 | 12.0 | 2.0 | 2.0 | 8.0 | 49.0 | 4.0 | 2.0 | 1.0 | 353.0 | 2018-05-05 | 2000.0 | 74.0 | 12.0 | 12.0 | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 12 | 2849787 | 20180507125159718000000023114911 | 卡号1 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | NaN | 9710 | 0 | 9.0 | 20170617.0 | 11520 | 31.0 | 1.00 | 5710 | 5.0 | 840 | 0.65 | 570 | 0.65 | 一线城市 | 0.0 | 68.0 | 0.0 | 68.0 | 0.0 | 3.0 | 6.0 | 6.0 | 0.0 | 0 | 3.0 | 0.0 | 0 | 0 | 0.0 | 1 | xs | 338.0 | 95.0 | 王中云 | 654.0 | 76.0 | 11.0 | 5.0 | 5.0 | 16.0 | 2018-05-05 | 5.0 | 5.0 | 14.0 | 633.0 | 83.0 | 4.0 | 2.0 | 0.0 | 3.0 | 1.0 | 2.0 | 2.0 | 2.0 | 4.0 | 2.0 | 2.0 | 1.0 | 1.0 | 157.0 | 2018-05-01 | 1500.0 | 77.0 | 2.0 | 2.0 | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 13 | 1809708 | 20180507121358683000000388283484 | 卡号1 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | NaN | 6210 | 0 | 33.0 | 20130516.0 | 491130 | 360.0 | 0.15 | 91690 | 7.0 | 46850 | 0.05 | 1290 | 0.45 | 三线城市 | 6.0 | 8.0 | 6.0 | 8.0 | 0.0 | 1.0 | 8.0 | 8.0 | 31700.0 | 8140 | 9.0 | 0.0 | 2700 | 27970 | 0.0 | 0 | xs | 1831.0 | 82.0 | 何洋洋 | 595.0 | 79.0 | 12.0 | 7.0 | 4.0 | 22.0 | 2018-05-05 | 3.0 | 16.0 | 17.0 | 542.0 | 75.0 | 85.0 | 81.0 | 4.0 | 22.0 | 5.0 | 17.0 | 2.0 | 4.0 | 34.0 | 91.0 | 26.0 | 2.0 | 0.0 | 355.0 | 2018-05-03 | 1800.0 | 74.0 | 17.0 | 18.0 | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 14 | 2499829 | 20180507115448545000000388205844 | 卡号1 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | NaN | 11150 | 0 | 12.0 | 20170312.0 | 61470 | 63.0 | 0.65 | 9770 | 6.0 | 760 | 1.00 | 1110 | 0.50 | 一线城市 | 0.0 | 66.0 | 0.0 | 66.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 1000 | 3.0 | 0.0 | 0 | 6410 | 0.0 | 1 | xs | 435.0 | 88.0 | 赵洋 | 541.0 | 75.0 | 11.0 | 3.0 | 4.0 | 14.0 | 2018-04-15 | 6.0 | 8.0 | 9.0 | 479.0 | 73.0 | 37.0 | 32.0 | 6.0 | 12.0 | 2.0 | 10.0 | 0.0 | 0.0 | 10.0 | 36.0 | 25.0 | 0.0 | 0.0 | 360.0 | 2018-01-07 | 1800.0 | 72.0 | 10.0 | 10.0 | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
2. 删除无关特征
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
trans_days_interval_filter 4746 non-null float64
trans_days_interval 4752 non-null float64
regional_mobility 4752 non-null float64
student_feature 1756 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4752 non-null float64
first_transaction_time 4752 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4752 non-null float64
rank_trad_1_month 4752 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4752 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4752 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4650 non-null float64
reg_preference_for_trad 4752 non-null object
trans_top_time_last_1_month 4746 non-null float64
trans_top_time_last_6_month 4746 non-null float64
consume_top_time_last_1_month 4746 non-null float64
consume_top_time_last_6_month 4746 non-null float64
cross_consume_count_last_1_month 4328 non-null float64
trans_fail_top_count_enum_last_1_month 4738 non-null float64
trans_fail_top_count_enum_last_6_month 4738 non-null float64
trans_fail_top_count_enum_last_12_month 4738 non-null float64
consume_mini_time_last_1_month 4728 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4746 non-null float64
railway_consume_count_last_12_month 4742 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4742 non-null float64
status 4754 non-null int64
source 4754 non-null object
first_transaction_day 4752 non-null float64
trans_day_last_12_month 4752 non-null float64
id_name 4478 non-null object
apply_score 4450 non-null float64
apply_credibility 4450 non-null float64
query_org_count 4450 non-null float64
query_finance_count 4450 non-null float64
query_cash_count 4450 non-null float64
query_sum_count 4450 non-null float64
latest_query_time 4450 non-null object
latest_one_month_apply 4450 non-null float64
latest_three_month_apply 4450 non-null float64
latest_six_month_apply 4450 non-null float64
loans_score 4457 non-null float64
loans_credibility_behavior 4457 non-null float64
loans_count 4457 non-null float64
loans_settle_count 4457 non-null float64
loans_overdue_count 4457 non-null float64
loans_org_count_behavior 4457 non-null float64
consfin_org_count_behavior 4457 non-null float64
loans_cash_count 4457 non-null float64
latest_one_month_loan 4457 non-null float64
latest_three_month_loan 4457 non-null float64
latest_six_month_loan 4457 non-null float64
history_suc_fee 4457 non-null float64
history_fail_fee 4457 non-null float64
latest_one_month_suc 4457 non-null float64
latest_one_month_fail 4457 non-null float64
loans_long_time 4457 non-null float64
loans_latest_time 4457 non-null object
loans_credit_limit 4457 non-null float64
loans_credibility_limit 4457 non-null float64
loans_org_count_current 4457 non-null float64
loans_product_count 4457 non-null float64
loans_max_limit 4457 non-null float64
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
data.drop(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no', 'source','id_name'], axis=1, inplace=True)
object_cols = [col for col in data.columns if data[col].dtypes == 'O']
data_obj=data[object_cols]
data_num=data.drop(object_cols,axis=1)
#缺失值填充
imputer=Imputer(strategy='mean')
mean_num=imputer.fit_transform(data_num)
data_num=pd.DataFrame(mean_num,columns=data_num.columns)
data_obj.ffill(inplace=True)
#One-HotEncoder
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
#['latest_query_time'] ['loans_latest_time']
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
data=pd.concat([data_num,data_obj],axis=1)
data.shape
(4754, 90)
from sklearn.model_selection import train_test_split
y=data['status']
X=data.drop(['status'],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score
def model_metrics(clf, X_train, X_test, y_train, y_test):
# 预测
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_proba = clf.predict_proba(X_train)[:,1]
y_test_proba = clf.predict_proba(X_test)[:,1]
# 准确率
print('[准确率]', end = ' ')
print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
# auc取值:用roc_auc_score或auc
print('[auc值]', end = ' ')
print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))
#构架模型并评价
from sklearn.linear_model import LogisticRegressionCV
clf=LogisticRegressionCV(class_weight='balanced',max_iter=5000)
clf.fit(X_train,y_train)
model_metrics(clf, X_train, X_test, y_train, y_test)
[准确率] 训练集: 0.4956 测试集: 0.5032
[auc值] 训练集: 0.5861 测试集: 0.5765
3.计算IV值
import math
from scipy import stats
from sklearn.utils.multiclass import type_of_target
def woe(X, y, event=1):
res_woe = []
iv_dict = {}
for feature in X.columns:
x = X[feature].values
# 1) 连续特征离散化
if type_of_target(x) == 'continuous':
x = discrete(x)
# 2) 计算该特征的woe和iv
# woe_dict, iv = woe_single_x(x, y, feature, event)
woe_dict, iv = woe_single_x(x, y, feature, event)
iv_dict[feature] = iv
res_woe.append(woe_dict)
return iv_dict
def discrete(x):
# 使用5等分离散化特征
res = np.zeros(x.shape)
for i in range(5):
point1 = stats.scoreatpercentile(x, i * 20)
point2 = stats.scoreatpercentile(x, (i + 1) * 20)
x1 = x[np.where((x >= point1) & (x <= point2))]
mask = np.in1d(x, x1)
res[mask] = i + 1 # 将[i, i+1]块内的值标记成i+1
return res
def woe_single_x(x, y, feature,event = 1):
# event代表预测正例的标签
event_total = sum(y == event)
non_event_total = y.shape[-1] - event_total
iv = 0
woe_dict = {}
for x1 in set(x): # 遍历各个块
y1 = y.reindex(np.where(x == x1)[0])
event_count = sum(y1 == event)
non_event_count = y1.shape[-1] - event_count
rate_event = event_count / event_total
rate_non_event = non_event_count / non_event_total
if rate_event == 0:
rate_event = 0.0001
# woei = -20
elif rate_non_event == 0:
rate_non_event = 0.0001
# woei = 20
woei = math.log(rate_event / rate_non_event)
woe_dict[x1] = woei
iv += (rate_event - rate_non_event) * woei
return woe_dict, iv
iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv
[('historical_trans_amount', 2.6609646134512865),
('trans_amount_3_month', 2.5546436077538357),
('repayment_capability', 2.327229251967252),
('pawns_auctions_trusts_consume_last_6_month', 2.220777389641486),
('abs', 1.966985825643712),
('max_cumulative_consume_later_1_month', 1.4598660465564153),
('pawns_auctions_trusts_consume_last_1_month', 0.8530625616084101),
('avg_price_last_12_month', 0.7281431950917352),
('take_amount_in_later_12_month_highest', 0.4407207265219969),
('latest_query_time_month', 0.25139126628755865),
('loans_latest_time_weekday', 0.24326338644309412),
('history_fail_fee', 0.23601952893571299),
('loans_latest_time_month', 0.23316679232272933),
('latest_query_day', 0.23165030755336188),
('history_suc_fee', 0.23132587006862826),
('trans_days_interval', 0.23127346695672282),
('trans_activity_day', 0.23089021521474926),
('latest_six_month_apply', 0.23004076549705482),
('apply_score', 0.22999736959648898),
('loans_avg_limit', 0.22937233933022275),
('loans_credibility_limit', 0.22923404864220617),
('二线城市', 0.22835785178159998),
('low_volume_percent', 0.22831922306127952),
('consfin_credibility', 0.22804472290267083),
('avg_price_top_last_12_valid_month', 0.22804418697211443),
('latest_three_month_loan', 0.22786568449353656),
('historical_trans_day', 0.22785892580201067),
('latest_one_month_loan', 0.2259858987958161),
('trans_day_last_12_month', 0.2258295769673027),
('loans_cash_count', 0.22582167536745912),
('loans_org_count_current', 0.22582167536745912),
('first_transaction_day', 0.22577667440590374),
('first_transaction_time', 0.22558029316437583),
('trans_amount_increase_rate_lately', 0.22553777250765294),
('middle_volume_percent', 0.22535903805135094),
('consume_top_time_last_6_month', 0.2253530270376462),
('query_org_count', 0.22529059153249648),
('一线城市', 0.2250434530120855),
('trans_top_time_last_6_month', 0.22440575809597219),
('trans_fail_top_count_enum_last_1_month', 0.2242031888186113),
('loans_org_count_behavior', 0.22411751635509966),
('latest_six_month_loan', 0.22372060084022297),
('境外', 0.22366745673000382),
('loans_product_count', 0.223611713328623),
('consfin_avg_limit', 0.22347209785006059),
('trans_days_interval_filter', 0.22340299880606143),
('number_of_trans_from_2011', 0.22308989504593096),
('apply_credibility', 0.22274532659739935),
('loans_overdue_count', 0.22262741793816765),
('loans_score', 0.2225002169626543),
('loans_latest_day', 0.22241446845462567),
('consfin_credit_limit', 0.22228309104879887),
('loans_count', 0.22227945107950234),
('loans_credibility_behavior', 0.22203257178296298),
('loans_settle_count', 0.2219171432554008),
('rank_trad_1_month', 0.2218401640065109),
('query_cash_count', 0.2216362408399449),
('loans_long_time', 0.22161254075577275),
('regional_mobility', 0.22150812112017015),
('latest_query_time_weekday', 0.2215008902334139),
('query_sum_count', 0.2210085646317297),
('consume_top_time_last_1_month', 0.2206710654401162),
('consume_mini_time_last_1_month', 0.22038175378437908),
('trans_fail_top_count_enum_last_12_month', 0.22027549211946645),
('consfin_max_limit', 0.22016256897174105),
('trans_activity_month', 0.22015938020797166),
('top_trans_count_last_1_month', 0.22013392802621778),
('latest_one_month_apply', 0.2198386582706197),
('consfin_product_count', 0.21981612729230302),
('max_consume_count_later_6_month', 0.21980267330646783),
('latest_three_month_apply', 0.219745341760752),
('consfin_org_count_behavior', 0.21964389703494608),
('consfin_org_count_current', 0.21964389703494608),
('avg_consume_less_12_valid_month', 0.21883300876505346),
('trans_fail_top_count_enum_last_6_month', 0.21882948763455295),
('cross_consume_count_last_1_month', 0.21869363923411914),
('transd_mcc', 0.21865573796739254),
('其他城市', 0.2185500702073389),
('student_feature', 0.21833192508051125),
('loans_credit_limit', 0.21821927461429466),
('trans_top_time_last_1_month', 0.21803681758281673),
('query_finance_count', 0.21790525591920654),
('loans_max_limit', 0.21760772188869903),
('三线城市', 0.21755723341508837),
('latest_one_month_fail', 0.21753031430667408),
('is_high_user', 0.2175215044170788),
('latest_one_month_suc', 0.21715553601300325),
('railway_consume_count_last_12_month', 0.21687601425054276),
('jewelry_consume_count_last_6_month', 0.21687601425054276)]
threshold = 0.1
data_index = []
for i in range(len(iv)):
if iv[i][1] < threshold:
data_index.append(iv[i])
print(iv[i])
#X_train.drop(data_index, axis=1, inplace=True)
删除无用和预测能力弱的特征,也就是IV值<0.1,但经过计算发现,数据集中最小值为(‘jewelry_consume_count_last_6_month’, 0.21687601425054276),因此不用这一步不需要删除特征。
4.使用随机森林选择特征
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数:', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)
feature_importance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)
袋外分数: 0.7868951006913135
[准确率] 训练集: 0.8200 测试集: 0.7800
[auc值] 训练集: 0.9010 测试集: 0.7686
feature_importance1
[('0.1361', 'trans_fail_top_count_enum_last_1_month'),
('0.0933', 'history_fail_fee'),
('0.0779', 'loans_score'),
('0.0513', 'loans_overdue_count'),
('0.0508', 'apply_score'),
('0.0379', 'latest_one_month_fail'),
('0.0365', 'trans_fail_top_count_enum_last_6_month'),
('0.0284', 'trans_fail_top_count_enum_last_12_month'),
('0.0199', 'trans_day_last_12_month'),
('0.0191', 'latest_one_month_suc'),
('0.0180', 'max_cumulative_consume_later_1_month'),
('0.0142', 'consfin_avg_limit'),
('0.0140', 'rank_trad_1_month'),
('0.0132', 'trans_amount_3_month'),
('0.0128', 'consume_top_time_last_1_month'),
('0.0125', 'latest_query_day'),
('0.0109', 'historical_trans_amount'),
('0.0099', 'trans_top_time_last_1_month'),
('0.0099', 'trans_activity_day'),
('0.0099', 'historical_trans_day'),
('0.0098', 'history_suc_fee'),
('0.0094', 'first_transaction_time'),
('0.0090', 'loans_latest_day'),
('0.0090', 'consfin_credit_limit'),
('0.0085', 'loans_count'),
('0.0084', 'loans_settle_count'),
('0.0083', 'trans_amount_increase_rate_lately'),
('0.0083', 'top_trans_count_last_1_month'),
('0.0079', 'latest_three_month_loan'),
('0.0079', 'first_transaction_day'),
('0.0079', 'consume_top_time_last_6_month'),
('0.0077', 'trans_days_interval'),
('0.0077', 'avg_price_last_12_month'),
('0.0075', 'loans_long_time'),
('0.0074', 'repayment_capability'),
('0.0073', 'consfin_max_limit'),
('0.0070', 'trans_top_time_last_6_month'),
('0.0070', 'trans_days_interval_filter'),
('0.0070', 'pawns_auctions_trusts_consume_last_6_month'),
('0.0070', 'latest_three_month_apply'),
('0.0068', 'trans_activity_month'),
('0.0067', 'loans_avg_limit'),
('0.0063', 'pawns_auctions_trusts_consume_last_1_month'),
('0.0061', 'consfin_credibility'),
('0.0060', 'loans_max_limit'),
('0.0058', 'loans_org_count_behavior'),
('0.0058', 'latest_six_month_loan'),
('0.0058', 'abs'),
('0.0054', 'apply_credibility'),
('0.0053', 'consume_mini_time_last_1_month'),
('0.0050', 'transd_mcc'),
('0.0050', 'consfin_product_count'),
('0.0049', 'avg_price_top_last_12_valid_month'),
('0.0048', 'latest_six_month_apply'),
('0.0047', 'take_amount_in_later_12_month_highest'),
('0.0047', 'loans_product_count'),
('0.0046', 'loans_cash_count'),
('0.0046', 'consfin_org_count_current'),
('0.0042', 'number_of_trans_from_2011'),
('0.0040', 'query_cash_count'),
('0.0040', 'loans_org_count_current'),
('0.0039', 'query_sum_count'),
('0.0039', 'loans_credit_limit'),
('0.0038', 'middle_volume_percent'),
('0.0036', 'consfin_org_count_behavior'),
('0.0035', 'max_consume_count_later_6_month'),
('0.0035', 'loans_credibility_behavior'),
('0.0034', 'query_finance_count'),
('0.0033', 'loans_latest_time_weekday'),
('0.0031', 'latest_one_month_apply'),
('0.0030', 'query_org_count'),
('0.0029', 'loans_credibility_limit'),
('0.0025', 'loans_latest_time_month'),
('0.0024', 'latest_query_time_weekday'),
('0.0018', 'latest_one_month_loan'),
('0.0016', 'latest_query_time_month'),
('0.0015', 'low_volume_percent'),
('0.0015', 'avg_consume_less_12_valid_month'),
('0.0011', 'cross_consume_count_last_1_month'),
('0.0009', '一线城市'),
('0.0009', 'regional_mobility'),
('0.0007', 'student_feature'),
('0.0004', '三线城市'),
('0.0000', '境外'),
('0.0000', '其他城市'),
('0.0000', '二线城市'),
('0.0000', 'railway_consume_count_last_12_month'),
('0.0000', 'jewelry_consume_count_last_6_month'),
('0.0000', 'is_high_user')]
X_train.head(5)
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | student_feature | repayment_capability | is_high_user | number_of_trans_from_2011 | first_transaction_time | historical_trans_amount | historical_trans_day | rank_trad_1_month | trans_amount_3_month | avg_consume_less_12_valid_month | abs | top_trans_count_last_1_month | avg_price_last_12_month | avg_price_top_last_12_valid_month | trans_top_time_last_1_month | trans_top_time_last_6_month | consume_top_time_last_1_month | consume_top_time_last_6_month | cross_consume_count_last_1_month | trans_fail_top_count_enum_last_1_month | trans_fail_top_count_enum_last_6_month | trans_fail_top_count_enum_last_12_month | consume_mini_time_last_1_month | max_cumulative_consume_later_1_month | max_consume_count_later_6_month | railway_consume_count_last_12_month | pawns_auctions_trusts_consume_last_1_month | pawns_auctions_trusts_consume_last_6_month | jewelry_consume_count_last_6_month | first_transaction_day | trans_day_last_12_month | apply_score | apply_credibility | query_org_count | query_finance_count | query_cash_count | query_sum_count | latest_one_month_apply | latest_three_month_apply | latest_six_month_apply | loans_score | loans_credibility_behavior | loans_count | loans_settle_count | loans_overdue_count | loans_org_count_behavior | consfin_org_count_behavior | loans_cash_count | latest_one_month_loan | latest_three_month_loan | latest_six_month_loan | history_suc_fee | history_fail_fee | latest_one_month_suc | latest_one_month_fail | loans_long_time | loans_credit_limit | loans_credibility_limit | loans_org_count_current | loans_product_count | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | 一线城市 | 三线城市 | 二线城市 | 其他城市 | 境外 | latest_query_time_month | latest_query_time_weekday | loans_latest_time_month | loans_latest_time_weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110 | 0.01 | 0.99 | 4000.0 | 0.96 | 1.00 | 0.405 | 16.0 | 29.0 | 28.0 | 1.0 | 1.000000 | 17570.0 | 0.0 | 13.0 | 20170217.0 | 181770.0 | 150.0 | 0.85 | 15610.0 | 7.0 | 2650.0 | 1.00 | 1220.0 | 0.45 | 0.0 | 29.0 | 0.0 | 29.0 | 0.0 | 6.0 | 9.0 | 9.0 | 0.0 | 220.0 | 6.0 | 0.0 | 0.0 | 10160.0 | 0.0 | 458.0 | 99.0 | 535.0 | 73.0 | 16.0 | 6.0 | 7.0 | 24.0 | 5.0 | 12.0 | 15.0 | 498.0 | 73.0 | 92.0 | 77.0 | 7.0 | 27.0 | 7.0 | 20.0 | 1.0 | 3.0 | 34.0 | 85.0 | 52.0 | 0.0 | 3.0 | 356.0 | 2400.0 | 72.0 | 20.0 | 22.0 | 5000.0 | 1845.0 | 10600.0 | 81.0 | 7.0 | 7.0 | 15600.0 | 8228.0 | 0.0 | 9.0 | 1 | 0 | 0 | 0 | 0 | 5 | 4 | 4 | 2 |
3394 | 0.03 | 0.97 | 500.0 | 0.87 | 1.00 | 0.205 | 18.0 | 27.0 | 27.0 | 3.0 | 1.001139 | 15310.0 | 0.0 | 12.0 | 20170331.0 | 63350.0 | 74.0 | 0.65 | 12200.0 | 6.0 | 3460.0 | 0.40 | 630.0 | 0.65 | 14.0 | 17.0 | 14.0 | 17.0 | 1.0 | 1.0 | 4.0 | 9.0 | 0.0 | 470.0 | 4.0 | 0.0 | 470.0 | 2060.0 | 0.0 | 416.0 | 82.0 | 540.0 | 81.0 | 8.0 | 3.0 | 3.0 | 9.0 | 1.0 | 3.0 | 6.0 | 510.0 | 76.0 | 19.0 | 16.0 | 3.0 | 7.0 | 5.0 | 2.0 | 1.0 | 1.0 | 5.0 | 22.0 | 11.0 | 1.0 | 0.0 | 357.0 | 2400.0 | 73.0 | 2.0 | 2.0 | 2600.0 | 1800.0 | 16300.0 | 78.0 | 5.0 | 5.0 | 21600.0 | 7160.0 | 30.0 | 27.0 | 1 | 0 | 0 | 0 | 0 | 4 | 5 | 4 | 1 |
3052 | 0.02 | 0.86 | 0.0 | 1.98 | 0.70 | 0.205 | 18.0 | 53.0 | 33.0 | 2.0 | 1.001139 | 12240.0 | 0.0 | 28.0 | 20141110.0 | 97190.0 | 93.0 | 0.45 | 33280.0 | 8.0 | 1060.0 | 0.30 | 930.0 | 0.55 | 11.0 | 21.0 | 11.0 | 21.0 | 0.0 | 0.0 | 4.0 | 21.0 | 0.0 | 1950.0 | 12.0 | 0.0 | 1950.0 | 8240.0 | 0.0 | 1288.0 | 82.0 | 516.0 | 75.0 | 14.0 | 8.0 | 6.0 | 19.0 | 5.0 | 8.0 | 12.0 | 482.0 | 77.0 | 16.0 | 16.0 | 2.0 | 8.0 | 5.0 | 3.0 | 0.0 | 0.0 | 7.0 | 20.0 | 5.0 | 0.0 | 0.0 | 314.0 | 1400.0 | 66.0 | 3.0 | 3.0 | 2300.0 | 1500.0 | 10400.0 | 82.0 | 5.0 | 5.0 | 13800.0 | 10320.0 | 3.0 | 137.0 | 1 | 0 | 0 | 0 | 0 | 5 | 4 | 12 | 3 |
490 | 0.02 | 0.81 | 1000.0 | 1.49 | 0.73 | 0.555 | 23.0 | 15.0 | 8.0 | 4.0 | 1.000000 | 4320.0 | 0.0 | 40.0 | 20130817.0 | 373700.0 | 356.0 | 0.30 | 61940.0 | 8.0 | 26200.0 | 0.10 | 1390.0 | 0.45 | 15.0 | 15.0 | 15.0 | 15.0 | 1.0 | 8.0 | 8.0 | 8.0 | 42936.0 | 3090.0 | 7.0 | 0.0 | 3140.0 | 67720.0 | 0.0 | 1738.0 | 82.0 | 491.0 | 74.0 | 11.0 | 6.0 | 4.0 | 12.0 | 1.0 | 4.0 | 7.0 | 448.0 | 78.0 | 40.0 | 22.0 | 7.0 | 17.0 | 11.0 | 6.0 | 0.0 | 3.0 | 18.0 | 40.0 | 78.0 | 0.0 | 10.0 | 356.0 | 2600.0 | 76.0 | 6.0 | 7.0 | 4500.0 | 2500.0 | 6600.0 | 78.0 | 11.0 | 12.0 | 17400.0 | 6418.0 | 20.0 | 51.0 | 0 | 1 | 0 | 0 | 0 | 4 | 1 | 3 | 5 |
1 | 0.02 | 0.94 | 2000.0 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | 1.000000 | 16970.0 | 0.0 | 23.0 | 20160402.0 | 302910.0 | 224.0 | 0.35 | 10590.0 | 5.0 | 6950.0 | 0.05 | 1210.0 | 0.50 | 13.0 | 30.0 | 13.0 | 30.0 | 0.0 | 0.0 | 3.0 | 3.0 | 330.0 | 2100.0 | 9.0 | 0.0 | 1820.0 | 15680.0 | 0.0 | 779.0 | 84.0 | 653.0 | 73.0 | 7.0 | 4.0 | 2.0 | 8.0 | 2.0 | 6.0 | 8.0 | 635.0 | 76.0 | 37.0 | 36.0 | 0.0 | 17.0 | 5.0 | 12.0 | 2.0 | 2.0 | 8.0 | 49.0 | 4.0 | 2.0 | 1.0 | 353.0 | 2000.0 | 74.0 | 12.0 | 12.0 | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 5 | 3 | 5 | 5 |
useless=[]
for feature in X_train.columns:
if feature in [t[1] for t in feature_importance1[30:]]:
useless.append(feature)
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)
5. 训练数据
#数据归一化评价
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score,roc_curve
lr_model = LogisticRegression(C = 0.1, penalty = 'l1')
svm_model = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
dt_model = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb_model = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5,
gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5,
objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgbm_model = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11,
gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1, seed=27)
gbdt_model=GradientBoostingClassifier(n_estimators=100)
models={'LR':lr_model, 'SVM':svm_model, 'DT':dt_model, 'GBDT':gbdt_model, 'XGBoost':xgb_model, 'LGBM':lgbm_model}
df_result=pd.DataFrame(columns=('model','accuracy','precision','recall','f1_score','auc'))
row=0
#定义评价函数
def evaluate(y_pre,y):
acc=accuracy_score(y,y_pre)
p=precision_score(y,y_pre)
r=recall_score(y,y_pre)
f1=f1_score(y,y_pre)
return acc,p,r,f1
for name,model in models.items():
print(name,'start training...')
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
y_proba=model.predict_proba(X_test)
acc,p,r,f1=evaluate(y_pred,y_test)
auc=roc_auc_score(y_test,y_proba[:,1])
df_result.loc[row]=[name,acc,p,r,f1,auc]
row+=1
print(df_result)
LR start training...
SVM start training...
DT start training...
GBDT start training...
XGBoost start training...
LGBM start training...
model accuracy precision recall f1_score auc
0 LR 0.786966 0.670807 0.300836 0.415385 0.785507
1 SVM 0.774352 0.707865 0.175487 0.281250 0.789644
2 DT 0.755431 0.518797 0.384401 0.441600 0.716664
3 GBDT 0.784163 0.616438 0.376045 0.467128 0.767292
4 XGBoost 0.791170 0.654822 0.359331 0.464029 0.782216
5 LGBM 0.791871 0.646226 0.381616 0.479860 0.777396
参考:
1.算法实践2
https://zhuanlan.zhihu.com/p/55913000
2.随机森林对特征的重要性
https://blog.csdn.net/zjuPeco/article/details/77371645?locationNum=7&fps=1
3.数据挖掘模型中的IV和WOE
https://blog.csdn.net/kevin7658/article/details/50780391
4.贷款用户逾期情况分析
https://blog.csdn.net/a786150017/article/details/84573202
img 小部件