任务2 - 特征工程(2天)
- 特征衍生
- 特征挑选:分别用IV值和随机森林等进行特征选择
- ……以及你能想到特征工程处理
由于之前没有接触过IV值,昨天也在准备一个面试,因此特征工程的任务先简单地完成吧,过两天有时间再完善。
用随机森林进行特征选择
用sklearn建立随机森林模型
from sklearn.ensemble import RandomForestClassifier
feat_labels = X_train.columns
forest = RandomForestClassifier(n_estimators=10,
random_state=2018,
n_jobs=-1)forest.fit(X_train, y_train)
求特征的重要性并划定一个重要值,筛选出重要特征
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
ele_important = []
for f in range(X_train.shape[1]): # X_train.shape[1] = 85
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
if importances[indices[f]] > 0.01: # 但重要性大于0.01时,将特征添加到列表中
ele_important.append(feat_labels[indices[f]])
返回:
1) history_fail_fee 0.053606
2) trans_fail_top_count_enum_last_1_month 0.052335
3) loans_score 0.043602
4) latest_one_month_fail 0.023373
5) apply_score 0.021318
6) trans_amount_3_month 0.019262
7) historical_trans_amount 0.019216
8) historical_trans_day 0.017932
9) loans_overdue_count 0.017879
10) trans_day_last_12_month 0.016720
11) trans_amount_increase_rate_lately 0.016153
12) abs 0.016059
13) first_transaction_day 0.016038
14) latest_query_time 0.015707
15) trans_days_interval_filter 0.015685
16) pawns_auctions_trusts_consume_last_6_month 0.015427
17) trans_fail_top_count_enum_last_12_month 0.015320
18) trans_fail_top_count_enum_last_6_month 0.015038
19) loans_avg_limit 0.015004
20) loans_latest_time 0.014227
21) trans_top_time_last_6_month 0.014110
22) consfin_avg_limit 0.014079
23) avg_price_last_12_month 0.013732
24) history_suc_fee 0.013427
25) loans_credit_limit 0.013246
26) transd_mcc 0.013121
27) repayment_capability 0.012875
28) trans_activity_day 0.012699
29) query_sum_count 0.012671
30) first_transaction_time 0.012670
31) consfin_credibility 0.012485
32) number_of_trans_from_2011 0.012200
33) trans_activity_month 0.012156
34) apply_credibility 0.011999
35) max_cumulative_consume_later_1_month 0.011906
36) latest_six_month_apply 0.011830
37) consfin_max_limit 0.011702
38) loans_settle_count 0.011602
39) consfin_credit_limit 0.011546
40) loans_max_limit 0.011419
41) latest_six_month_loan 0.011267
42) rank_trad_1_month 0.011264
43) query_finance_count 0.011121
44) loans_credibility_limit 0.010998
45) avg_price_top_last_12_valid_month 0.010975
46) loans_credibility_behavior 0.010795
47) consume_mini_time_last_1_month 0.010740
48) consume_top_time_last_6_month 0.010649
49) pawns_auctions_trusts_consume_last_1_month 0.010599
50) loans_org_count_behavior 0.010529
51) loans_product_count 0.010267
52) loans_long_time 0.010188
53) trans_top_time_last_1_month 0.010051
54) middle_volume_percent 0.009383
55) query_org_count 0.009220
56) trans_days_interval 0.009068
57) latest_one_month_apply 0.008750
58) loans_count 0.008037
59) latest_three_month_apply 0.007853
60) latest_three_month_loan 0.007771
61) take_amount_in_later_12_month_highest 0.007490
62) loans_org_count_current 0.007430
63) consume_top_time_last_1_month 0.007418
64) consfin_org_count_current 0.007333
65) top_trans_count_last_1_month 0.007296
66) consfin_product_count 0.007164
67) max_consume_count_later_6_month 0.006900
68) query_cash_count 0.006763
69) cross_consume_count_last_1_month 0.006758
70) consfin_org_count_behavior 0.006517
71) latest_one_month_suc 0.006140
72) low_volume_percent 0.005943
73) loans_cash_count 0.005465
74) latest_one_month_loan 0.005230
75) avg_consume_less_12_valid_month 0.003915
76) regional_mobility 0.003877
77) student_feature 0.002383
78) reg_preference_for_trad_一线城市 0.001258
79) reg_preference_for_trad_三线城市 0.001258
80) reg_preference_for_trad_二线城市 0.001069
81) reg_preference_for_trad_境外 0.000674
82) railway_consume_count_last_12_month 0.000492
83) is_high_user 0.000328
84) jewelry_consume_count_last_6_month 0.000000
85) reg_preference_for_trad_其他城市 0.000000
按重要性设置X_train和y_train
X_train = X_train[ele_important]
X_test = X_test[ele_important]
特征选取暂时结束,待完善及深入。。。。
待做的:标准化数据
标准化数据(零均值和单位方差)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print("mean:", np.mean(X_train, axis=0),
np.mean(X_test, axis=0)) # 均值应该为0
print('std:', np.std(X_train, axis=0),
np.std(X_test, axis=0)) # 标准差应该为1
--- End ---