【一周算法进阶】--任务二 特征工程

Tast2 特征工程

特征选择:分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。

 1.什么是IV值? Information Value(信息价值,信息量)

本人是第一次接触IV值,先来一波介绍:

简单来说,IV值是用来衡量变量的预测能力,类似的有基尼系数,信息增益等。我们在构建模型时,经常需要对自变量进行筛选。比如我们有200个候选自变量,通常情况下,不会直接把200个变量直接放到模型中去进行拟合训练,而是会用一些方法,从这200个自变量中挑选一些出来,放进模型,形成入模变量列表。

2.IV值的计算

先来引入WOE,Weight of Evidence(证据权重)。首先把这个变量进行分组处理(也叫离散化、分箱等等,说的都是一个意思)。分组后,对于第i组,WOE的计算公式如下:

[点击并拖拽以移动]

其中,pyi是这个组中响应客户(风险模型中,对应的是违约客户,总之,指的是模型中预测变量取值为“是”或者说1的个体)占所有样本中所有响应客户的比例,pni是这个组中未响应客户占样本中所有未响应客户的比例,#yi是这个组中响应客户的数量,#ni是这个组中未响应客户的数量,#yT是样本中所有响应客户的数量,#nT是样本中所有未响应客户的数量。

这个式子表示的是当前这个组中响应的客户和未响应客户的比值,和所有样本中这个比值的差异。这个差异是用这两个比值的比值,再取对数来表示的。WOE越大,这种差异越大,这个分组里的样本响应的可能性就越大,WOE越小,差异越小,这个分组里的样本响应的可能性就越小。

IV值的计算:

[点击并拖拽以移动]

3.IV 值不能自动处理变量的分组中出现响应比例为0或100%的情况。那么,遇到响应比例为0或者100%的情况,我们应该怎么做呢?建议如下:

(1)当变量一个分组中,响应数=0时,

[点击并拖拽以移动]

(2)当变量一个分组中,没有响应的数量 = 0时,

[点击并拖拽以移动]

(1)如果可能,直接把这个分组做成一个规则,作为模型的前置条件或补充条件;

(2)重新对变量进行离散化或分组,使每个分组的响应比例都不为0且不为100%,尤其是当一个分组个体数很小时(比如小于100个),强烈建议这样做,因为本身把一个分组个体数弄得很小就不是太合理。

(3)如果上面两种方法都无法使用,建议人工把该分组的响应数和非响应的数量进行一定的调整。如果响应数原本为0,可以人工调整响应数为1,如果非响应数原本为0,可以人工调整非响应数为1.

4.随机森林选择特征

用随机森林进行特征重要性评估的思想其实很简单,说白了就是看看每个特征在随机森林中的每颗树上做了多大的贡献,然后取个平均值,最后比一比特征之间的贡献大小。通常可以用基尼指数(Gini index)或者袋外数据(OOB)错误率作为评价指标来衡量。

随机森林提供了两种特征选择的方法:mean decrease impurity和mean decrease accuracy。
平均不纯度减少 mean decrease impurity

利用不纯度可以确定节点(最优条件). 对于分类问题,常采用基尼不纯度/信息增益;对于回归问题,常采用方差/最小二乘拟合。

训练决策树时,可以计算每个特征减少了多少树的不纯度。对于一个决策树森林来说,可以算出每个特征平均减少了多少不纯度,并把它平均减少的不纯度作为特征选择的值。
【缺点】
1)该方法存在偏向, 对具有更多类别的变量更有利;
2)label存在多个关联特征(任意一个都可以作为优秀特征), 则一旦某个特征被选择, 其他特征的重要性会急剧降低。这会造成误解:错误的认为先被选中的特征是很重要的,而其余的特征是不重要的。

平均精确率减少 Mean decrease accuracy

直接度量每个特征对模型精确率的影响。

打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。
对于不重要的变量来说,打乱顺序对模型的精确率影响不会太大,但是对于重要的变量来说,打乱顺序就会降低模型的精确率。

1. 导入相关包&读取数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer,OneHotEncoder,Imputer

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

data_original=pd.read_csv('data.csv',skipinitialspace=True)
data=data_original.copy()
data.head(5)

Unnamed: 0custidtrade_nobank_card_nolow_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobilitystudent_featurerepayment_capabilityis_high_usernumber_of_trans_from_2011first_transaction_timehistorical_trans_amounthistorical_trans_dayrank_trad_1_monthtrans_amount_3_monthavg_consume_less_12_valid_monthabstop_trans_count_last_1_monthavg_price_last_12_monthavg_price_top_last_12_valid_monthreg_preference_for_tradtrans_top_time_last_1_monthtrans_top_time_last_6_monthconsume_top_time_last_1_monthconsume_top_time_last_6_monthcross_consume_count_last_1_monthtrans_fail_top_count_enum_last_1_monthtrans_fail_top_count_enum_last_6_monthtrans_fail_top_count_enum_last_12_monthconsume_mini_time_last_1_monthmax_cumulative_consume_later_1_monthmax_consume_count_later_6_monthrailway_consume_count_last_12_monthpawns_auctions_trusts_consume_last_1_monthpawns_auctions_trusts_consume_last_6_monthjewelry_consume_count_last_6_monthstatussourcefirst_transaction_daytrans_day_last_12_monthid_nameapply_scoreapply_credibilityquery_org_countquery_finance_countquery_cash_countquery_sum_countlatest_query_timelatest_one_month_applylatest_three_month_applylatest_six_month_applyloans_scoreloans_credibility_behaviorloans_countloans_settle_countloans_overdue_countloans_org_count_behaviorconsfin_org_count_behaviorloans_cash_countlatest_one_month_loanlatest_three_month_loanlatest_six_month_loanhistory_suc_feehistory_fail_feelatest_one_month_suclatest_one_month_failloans_long_timeloans_latest_timeloans_credit_limitloans_credibility_limitloans_org_count_currentloans_product_countloans_max_limitloans_avg_limitconsfin_credit_limitconsfin_credibilityconsfin_org_count_currentconsfin_product_countconsfin_max_limitconsfin_avg_limitlatest_query_dayloans_latest_day
05279185820180507115231274000000023057383卡号10.010.9900.900.550.31317.027.026.03.0NaN19890030.020130817.0149050151.00.40340307.039200.1510200.55一线城市4.019.04.019.01.01.02.02.05.021706.00.01970180400.01xs1738.085.0蒋红583.079.08.02.06.010.02018-04-252.05.08.0552.073.037.034.02.010.01.09.01.01.013.037.07.01.00.0341.02018-04-192200.072.09.010.02900.01688.01200.075.01.02.01200.01200.012.018.0
11053404720180507121002192000000023073000卡号10.020.9420001.281.000.45819.030.014.04.01.016970023.020160402.0302910224.00.35105905.069500.0512100.50一线城市13.030.013.030.00.00.03.03.0330.021009.00.01820156800.00xs779.084.0崔向朝653.073.07.04.02.08.02018-05-032.06.08.0635.076.037.036.00.017.05.012.02.02.08.049.04.02.01.0353.02018-05-052000.074.012.012.03500.01758.015100.080.05.06.022800.09360.04.02.0
212284978720180507125159718000000023114911卡号10.040.9601.001.000.11413.068.022.01.0NaN971009.020170617.01152031.01.0057105.08400.655700.65一线城市0.068.00.068.00.03.06.06.00.003.00.0000.01xs338.095.0王中云654.076.011.05.05.016.02018-05-055.05.014.0633.083.04.02.00.03.01.02.02.02.04.02.02.01.01.0157.02018-05-011500.077.02.02.01600.01250.04200.087.01.01.04200.04200.02.06.0
313180970820180507121358683000000388283484卡号10.000.9620000.130.570.77722.014.06.03.0NaN6210033.020130516.0491130360.00.15916907.0468500.0512900.45三线城市6.08.06.08.00.01.08.08.031700.081409.00.02700279700.00xs1831.082.0何洋洋595.079.012.07.04.022.02018-05-053.016.017.0542.075.085.081.04.022.05.017.02.04.034.091.026.02.00.0355.02018-05-031800.074.017.018.03200.01541.016300.080.05.05.030000.012180.02.04.0
414249982920180507115448545000000388205844卡号10.010.9900.461.000.17513.066.042.01.0NaN11150012.020170312.06147063.00.6597706.07601.0011100.50一线城市0.066.00.066.00.03.03.03.00.010003.00.0064100.01xs435.088.0赵洋541.075.011.03.04.014.02018-04-156.08.09.0479.073.037.032.06.012.02.010.00.00.010.036.025.00.00.0360.02018-01-071800.072.010.010.02300.01630.08300.079.02.02.08400.08250.022.0120.0

2. 删除无关特征

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
data.drop(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no', 'source','id_name'], axis=1, inplace=True)
object_cols = [col for col in data.columns if data[col].dtypes == 'O']
data_obj=data[object_cols]
data_num=data.drop(object_cols,axis=1)

#缺失值填充
imputer=Imputer(strategy='mean')
mean_num=imputer.fit_transform(data_num)
data_num=pd.DataFrame(mean_num,columns=data_num.columns)
data_obj.ffill(inplace=True)
#One-HotEncoder
encoder = LabelBinarizer()
reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
data_obj = pd.concat([data_obj, reg_preference_df], axis=1)

#['latest_query_time']  ['loans_latest_time']
data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday

data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday

data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)

data=pd.concat([data_num,data_obj],axis=1)
data.shape

(4754, 90)
from sklearn.model_selection import train_test_split
y=data['status']
X=data.drop(['status'],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=2018)
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值:用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))

#构架模型并评价
from sklearn.linear_model import LogisticRegressionCV

clf=LogisticRegressionCV(class_weight='balanced',max_iter=5000)
clf.fit(X_train,y_train)
model_metrics(clf, X_train, X_test, y_train, y_test)
[准确率] 训练集: 0.4956 测试集: 0.5032
[auc值] 训练集: 0.5861 测试集: 0.5765

3.计算IV值

import math
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv


iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv

[('historical_trans_amount', 2.6609646134512865),
 ('trans_amount_3_month', 2.5546436077538357),
 ('repayment_capability', 2.327229251967252),
 ('pawns_auctions_trusts_consume_last_6_month', 2.220777389641486),
 ('abs', 1.966985825643712),
 ('max_cumulative_consume_later_1_month', 1.4598660465564153),
 ('pawns_auctions_trusts_consume_last_1_month', 0.8530625616084101),
 ('avg_price_last_12_month', 0.7281431950917352),
 ('take_amount_in_later_12_month_highest', 0.4407207265219969),
 ('latest_query_time_month', 0.25139126628755865),
 ('loans_latest_time_weekday', 0.24326338644309412),
 ('history_fail_fee', 0.23601952893571299),
 ('loans_latest_time_month', 0.23316679232272933),
 ('latest_query_day', 0.23165030755336188),
 ('history_suc_fee', 0.23132587006862826),
 ('trans_days_interval', 0.23127346695672282),
 ('trans_activity_day', 0.23089021521474926),
 ('latest_six_month_apply', 0.23004076549705482),
 ('apply_score', 0.22999736959648898),
 ('loans_avg_limit', 0.22937233933022275),
 ('loans_credibility_limit', 0.22923404864220617),
 ('二线城市', 0.22835785178159998),
 ('low_volume_percent', 0.22831922306127952),
 ('consfin_credibility', 0.22804472290267083),
 ('avg_price_top_last_12_valid_month', 0.22804418697211443),
 ('latest_three_month_loan', 0.22786568449353656),
 ('historical_trans_day', 0.22785892580201067),
 ('latest_one_month_loan', 0.2259858987958161),
 ('trans_day_last_12_month', 0.2258295769673027),
 ('loans_cash_count', 0.22582167536745912),
 ('loans_org_count_current', 0.22582167536745912),
 ('first_transaction_day', 0.22577667440590374),
 ('first_transaction_time', 0.22558029316437583),
 ('trans_amount_increase_rate_lately', 0.22553777250765294),
 ('middle_volume_percent', 0.22535903805135094),
 ('consume_top_time_last_6_month', 0.2253530270376462),
 ('query_org_count', 0.22529059153249648),
 ('一线城市', 0.2250434530120855),
 ('trans_top_time_last_6_month', 0.22440575809597219),
 ('trans_fail_top_count_enum_last_1_month', 0.2242031888186113),
 ('loans_org_count_behavior', 0.22411751635509966),
 ('latest_six_month_loan', 0.22372060084022297),
 ('境外', 0.22366745673000382),
 ('loans_product_count', 0.223611713328623),
 ('consfin_avg_limit', 0.22347209785006059),
 ('trans_days_interval_filter', 0.22340299880606143),
 ('number_of_trans_from_2011', 0.22308989504593096),
 ('apply_credibility', 0.22274532659739935),
 ('loans_overdue_count', 0.22262741793816765),
 ('loans_score', 0.2225002169626543),
 ('loans_latest_day', 0.22241446845462567),
 ('consfin_credit_limit', 0.22228309104879887),
 ('loans_count', 0.22227945107950234),
 ('loans_credibility_behavior', 0.22203257178296298),
 ('loans_settle_count', 0.2219171432554008),
 ('rank_trad_1_month', 0.2218401640065109),
 ('query_cash_count', 0.2216362408399449),
 ('loans_long_time', 0.22161254075577275),
 ('regional_mobility', 0.22150812112017015),
 ('latest_query_time_weekday', 0.2215008902334139),
 ('query_sum_count', 0.2210085646317297),
 ('consume_top_time_last_1_month', 0.2206710654401162),
 ('consume_mini_time_last_1_month', 0.22038175378437908),
 ('trans_fail_top_count_enum_last_12_month', 0.22027549211946645),
 ('consfin_max_limit', 0.22016256897174105),
 ('trans_activity_month', 0.22015938020797166),
 ('top_trans_count_last_1_month', 0.22013392802621778),
 ('latest_one_month_apply', 0.2198386582706197),
 ('consfin_product_count', 0.21981612729230302),
 ('max_consume_count_later_6_month', 0.21980267330646783),
 ('latest_three_month_apply', 0.219745341760752),
 ('consfin_org_count_behavior', 0.21964389703494608),
 ('consfin_org_count_current', 0.21964389703494608),
 ('avg_consume_less_12_valid_month', 0.21883300876505346),
 ('trans_fail_top_count_enum_last_6_month', 0.21882948763455295),
 ('cross_consume_count_last_1_month', 0.21869363923411914),
 ('transd_mcc', 0.21865573796739254),
 ('其他城市', 0.2185500702073389),
 ('student_feature', 0.21833192508051125),
 ('loans_credit_limit', 0.21821927461429466),
 ('trans_top_time_last_1_month', 0.21803681758281673),
 ('query_finance_count', 0.21790525591920654),
 ('loans_max_limit', 0.21760772188869903),
 ('三线城市', 0.21755723341508837),
 ('latest_one_month_fail', 0.21753031430667408),
 ('is_high_user', 0.2175215044170788),
 ('latest_one_month_suc', 0.21715553601300325),
 ('railway_consume_count_last_12_month', 0.21687601425054276),
 ('jewelry_consume_count_last_6_month', 0.21687601425054276)]
threshold = 0.1
data_index = []
for i in range(len(iv)):
    if iv[i][1] < threshold:
        data_index.append(iv[i])
        print(iv[i])
#X_train.drop(data_index, axis=1, inplace=True)

删除无用和预测能力弱的特征,也就是IV值<0.1,但经过计算发现,数据集中最小值为(‘jewelry_consume_count_last_6_month’, 0.21687601425054276),因此不用这一步不需要删除特征。

4.使用随机森林选择特征

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数:', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)
feature_importance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)
袋外分数: 0.7868951006913135
[准确率] 训练集: 0.8200 测试集: 0.7800
[auc值] 训练集: 0.9010 测试集: 0.7686
feature_importance1
[('0.1361', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0933', 'history_fail_fee'),
 ('0.0779', 'loans_score'),
 ('0.0513', 'loans_overdue_count'),
 ('0.0508', 'apply_score'),
 ('0.0379', 'latest_one_month_fail'),
 ('0.0365', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0284', 'trans_fail_top_count_enum_last_12_month'),
 ('0.0199', 'trans_day_last_12_month'),
 ('0.0191', 'latest_one_month_suc'),
 ('0.0180', 'max_cumulative_consume_later_1_month'),
 ('0.0142', 'consfin_avg_limit'),
 ('0.0140', 'rank_trad_1_month'),
 ('0.0132', 'trans_amount_3_month'),
 ('0.0128', 'consume_top_time_last_1_month'),
 ('0.0125', 'latest_query_day'),
 ('0.0109', 'historical_trans_amount'),
 ('0.0099', 'trans_top_time_last_1_month'),
 ('0.0099', 'trans_activity_day'),
 ('0.0099', 'historical_trans_day'),
 ('0.0098', 'history_suc_fee'),
 ('0.0094', 'first_transaction_time'),
 ('0.0090', 'loans_latest_day'),
 ('0.0090', 'consfin_credit_limit'),
 ('0.0085', 'loans_count'),
 ('0.0084', 'loans_settle_count'),
 ('0.0083', 'trans_amount_increase_rate_lately'),
 ('0.0083', 'top_trans_count_last_1_month'),
 ('0.0079', 'latest_three_month_loan'),
 ('0.0079', 'first_transaction_day'),
 ('0.0079', 'consume_top_time_last_6_month'),
 ('0.0077', 'trans_days_interval'),
 ('0.0077', 'avg_price_last_12_month'),
 ('0.0075', 'loans_long_time'),
 ('0.0074', 'repayment_capability'),
 ('0.0073', 'consfin_max_limit'),
 ('0.0070', 'trans_top_time_last_6_month'),
 ('0.0070', 'trans_days_interval_filter'),
 ('0.0070', 'pawns_auctions_trusts_consume_last_6_month'),
 ('0.0070', 'latest_three_month_apply'),
 ('0.0068', 'trans_activity_month'),
 ('0.0067', 'loans_avg_limit'),
 ('0.0063', 'pawns_auctions_trusts_consume_last_1_month'),
 ('0.0061', 'consfin_credibility'),
 ('0.0060', 'loans_max_limit'),
 ('0.0058', 'loans_org_count_behavior'),
 ('0.0058', 'latest_six_month_loan'),
 ('0.0058', 'abs'),
 ('0.0054', 'apply_credibility'),
 ('0.0053', 'consume_mini_time_last_1_month'),
 ('0.0050', 'transd_mcc'),
 ('0.0050', 'consfin_product_count'),
 ('0.0049', 'avg_price_top_last_12_valid_month'),
 ('0.0048', 'latest_six_month_apply'),
 ('0.0047', 'take_amount_in_later_12_month_highest'),
 ('0.0047', 'loans_product_count'),
 ('0.0046', 'loans_cash_count'),
 ('0.0046', 'consfin_org_count_current'),
 ('0.0042', 'number_of_trans_from_2011'),
 ('0.0040', 'query_cash_count'),
 ('0.0040', 'loans_org_count_current'),
 ('0.0039', 'query_sum_count'),
 ('0.0039', 'loans_credit_limit'),
 ('0.0038', 'middle_volume_percent'),
 ('0.0036', 'consfin_org_count_behavior'),
 ('0.0035', 'max_consume_count_later_6_month'),
 ('0.0035', 'loans_credibility_behavior'),
 ('0.0034', 'query_finance_count'),
 ('0.0033', 'loans_latest_time_weekday'),
 ('0.0031', 'latest_one_month_apply'),
 ('0.0030', 'query_org_count'),
 ('0.0029', 'loans_credibility_limit'),
 ('0.0025', 'loans_latest_time_month'),
 ('0.0024', 'latest_query_time_weekday'),
 ('0.0018', 'latest_one_month_loan'),
 ('0.0016', 'latest_query_time_month'),
 ('0.0015', 'low_volume_percent'),
 ('0.0015', 'avg_consume_less_12_valid_month'),
 ('0.0011', 'cross_consume_count_last_1_month'),
 ('0.0009', '一线城市'),
 ('0.0009', 'regional_mobility'),
 ('0.0007', 'student_feature'),
 ('0.0004', '三线城市'),
 ('0.0000', '境外'),
 ('0.0000', '其他城市'),
 ('0.0000', '二线城市'),
 ('0.0000', 'railway_consume_count_last_12_month'),
 ('0.0000', 'jewelry_consume_count_last_6_month'),
 ('0.0000', 'is_high_user')]
X_train.head(5)
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobilitystudent_featurerepayment_capabilityis_high_usernumber_of_trans_from_2011first_transaction_timehistorical_trans_amounthistorical_trans_dayrank_trad_1_monthtrans_amount_3_monthavg_consume_less_12_valid_monthabstop_trans_count_last_1_monthavg_price_last_12_monthavg_price_top_last_12_valid_monthtrans_top_time_last_1_monthtrans_top_time_last_6_monthconsume_top_time_last_1_monthconsume_top_time_last_6_monthcross_consume_count_last_1_monthtrans_fail_top_count_enum_last_1_monthtrans_fail_top_count_enum_last_6_monthtrans_fail_top_count_enum_last_12_monthconsume_mini_time_last_1_monthmax_cumulative_consume_later_1_monthmax_consume_count_later_6_monthrailway_consume_count_last_12_monthpawns_auctions_trusts_consume_last_1_monthpawns_auctions_trusts_consume_last_6_monthjewelry_consume_count_last_6_monthfirst_transaction_daytrans_day_last_12_monthapply_scoreapply_credibilityquery_org_countquery_finance_countquery_cash_countquery_sum_countlatest_one_month_applylatest_three_month_applylatest_six_month_applyloans_scoreloans_credibility_behaviorloans_countloans_settle_countloans_overdue_countloans_org_count_behaviorconsfin_org_count_behaviorloans_cash_countlatest_one_month_loanlatest_three_month_loanlatest_six_month_loanhistory_suc_feehistory_fail_feelatest_one_month_suclatest_one_month_failloans_long_timeloans_credit_limitloans_credibility_limitloans_org_count_currentloans_product_countloans_max_limitloans_avg_limitconsfin_credit_limitconsfin_credibilityconsfin_org_count_currentconsfin_product_countconsfin_max_limitconsfin_avg_limitlatest_query_dayloans_latest_day一线城市三线城市二线城市其他城市境外latest_query_time_monthlatest_query_time_weekdayloans_latest_time_monthloans_latest_time_weekday
1100.010.994000.00.961.000.40516.029.028.01.01.00000017570.00.013.020170217.0181770.0150.00.8515610.07.02650.01.001220.00.450.029.00.029.00.06.09.09.00.0220.06.00.00.010160.00.0458.099.0535.073.016.06.07.024.05.012.015.0498.073.092.077.07.027.07.020.01.03.034.085.052.00.03.0356.02400.072.020.022.05000.01845.010600.081.07.07.015600.08228.00.09.0100005442
33940.030.97500.00.871.000.20518.027.027.03.01.00113915310.00.012.020170331.063350.074.00.6512200.06.03460.00.40630.00.6514.017.014.017.01.01.04.09.00.0470.04.00.0470.02060.00.0416.082.0540.081.08.03.03.09.01.03.06.0510.076.019.016.03.07.05.02.01.01.05.022.011.01.00.0357.02400.073.02.02.02600.01800.016300.078.05.05.021600.07160.030.027.0100004541
30520.020.860.01.980.700.20518.053.033.02.01.00113912240.00.028.020141110.097190.093.00.4533280.08.01060.00.30930.00.5511.021.011.021.00.00.04.021.00.01950.012.00.01950.08240.00.01288.082.0516.075.014.08.06.019.05.08.012.0482.077.016.016.02.08.05.03.00.00.07.020.05.00.00.0314.01400.066.03.03.02300.01500.010400.082.05.05.013800.010320.03.0137.01000054123
4900.020.811000.01.490.730.55523.015.08.04.01.0000004320.00.040.020130817.0373700.0356.00.3061940.08.026200.00.101390.00.4515.015.015.015.01.08.08.08.042936.03090.07.00.03140.067720.00.01738.082.0491.074.011.06.04.012.01.04.07.0448.078.040.022.07.017.011.06.00.03.018.040.078.00.010.0356.02600.076.06.07.04500.02500.06600.078.011.012.017400.06418.020.051.0010004135
10.020.942000.01.281.000.45819.030.014.04.01.00000016970.00.023.020160402.0302910.0224.00.3510590.05.06950.00.051210.00.5013.030.013.030.00.00.03.03.0330.02100.09.00.01820.015680.00.0779.084.0653.073.07.04.02.08.02.06.08.0635.076.037.036.00.017.05.012.02.02.08.049.04.02.01.0353.02000.074.012.012.03500.01758.015100.080.05.06.022800.09360.04.02.0100005355
useless=[]
for feature in X_train.columns:
    if feature in [t[1] for t in feature_importance1[30:]]:
        useless.append(feature)
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)

5. 训练数据

#数据归一化评价
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,roc_auc_score,roc_curve

lr_model = LogisticRegression(C = 0.1, penalty = 'l1')
svm_model = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
dt_model = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb_model = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5, 
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5, 
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgbm_model = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11, 
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1, seed=27)
gbdt_model=GradientBoostingClassifier(n_estimators=100)

models={'LR':lr_model, 'SVM':svm_model, 'DT':dt_model, 'GBDT':gbdt_model, 'XGBoost':xgb_model, 'LGBM':lgbm_model}


df_result=pd.DataFrame(columns=('model','accuracy','precision','recall','f1_score','auc'))
row=0
#定义评价函数
def evaluate(y_pre,y):
    acc=accuracy_score(y,y_pre)
    p=precision_score(y,y_pre)
    r=recall_score(y,y_pre)
    f1=f1_score(y,y_pre)
    return acc,p,r,f1

for name,model in models.items():
    print(name,'start training...')
    model.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    y_proba=model.predict_proba(X_test)
    acc,p,r,f1=evaluate(y_pred,y_test)
    auc=roc_auc_score(y_test,y_proba[:,1])
    df_result.loc[row]=[name,acc,p,r,f1,auc]
    row+=1
print(df_result)

LR start training...
SVM start training...
DT start training...
GBDT start training...
XGBoost start training...
LGBM start training...
     model  accuracy  precision    recall  f1_score       auc
0       LR  0.786966   0.670807  0.300836  0.415385  0.785507
1      SVM  0.774352   0.707865  0.175487  0.281250  0.789644
2       DT  0.755431   0.518797  0.384401  0.441600  0.716664
3     GBDT  0.784163   0.616438  0.376045  0.467128  0.767292
4  XGBoost  0.791170   0.654822  0.359331  0.464029  0.782216
5     LGBM  0.791871   0.646226  0.381616  0.479860  0.777396

参考:

1.算法实践2

https://zhuanlan.zhihu.com/p/55913000

2.随机森林对特征的重要性

https://blog.csdn.net/zjuPeco/article/details/77371645?locationNum=7&fps=1

3.数据挖掘模型中的IV和WOE

https://blog.csdn.net/kevin7658/article/details/50780391

4.贷款用户逾期情况分析

https://blog.csdn.net/a786150017/article/details/84573202
img 小部件

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值