DataWhale 组队学习数据挖掘实践 任务二

本文介绍了在数据挖掘任务中如何运用随机森林进行特征选择,通过sklearn库建立随机森林模型,并根据特征重要性筛选出重要特征。接着,文章展示了如何使用StandardScaler进行数据标准化,确保数据具有零均值和单位方差,以优化模型性能。
摘要由CSDN通过智能技术生成

任务2 - 特征工程(2天)

  • 特征衍生
  • 特征挑选:分别用IV值和随机森林等进行特征选择
  • ……以及你能想到特征工程处理

 

由于之前没有接触过IV值,昨天也在准备一个面试,因此特征工程的任务先简单地完成吧,过两天有时间再完善。

 

用随机森林进行特征选择

 

用sklearn建立随机森林模型

from sklearn.ensemble import RandomForestClassifier

feat_labels = X_train.columns

forest = RandomForestClassifier(n_estimators=10,
                               random_state=2018,
                               n_jobs=-1)

forest.fit(X_train, y_train)

 

求特征的重要性并划定一个重要值,筛选出重要特征

importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
ele_important = []
for f in range(X_train.shape[1]):    # X_train.shape[1] = 85
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
    if importances[indices[f]] > 0.01:  # 但重要性大于0.01时,将特征添加到列表中
        ele_important.append(feat_labels[indices[f]])

返回:

 1) history_fail_fee               0.053606
 2) trans_fail_top_count_enum_last_1_month 0.052335
 3) loans_score                    0.043602
 4) latest_one_month_fail          0.023373
 5) apply_score                    0.021318
 6) trans_amount_3_month           0.019262
 7) historical_trans_amount        0.019216
 8) historical_trans_day           0.017932
 9) loans_overdue_count            0.017879
10) trans_day_last_12_month        0.016720
11) trans_amount_increase_rate_lately 0.016153
12) abs                            0.016059
13) first_transaction_day          0.016038
14) latest_query_time              0.015707
15) trans_days_interval_filter     0.015685
16) pawns_auctions_trusts_consume_last_6_month 0.015427
17) trans_fail_top_count_enum_last_12_month 0.015320
18) trans_fail_top_count_enum_last_6_month 0.015038
19) loans_avg_limit                0.015004
20) loans_latest_time              0.014227
21) trans_top_time_last_6_month    0.014110
22) consfin_avg_limit              0.014079
23) avg_price_last_12_month        0.013732
24) history_suc_fee                0.013427
25) loans_credit_limit             0.013246
26) transd_mcc                     0.013121
27) repayment_capability           0.012875
28) trans_activity_day             0.012699
29) query_sum_count                0.012671
30) first_transaction_time         0.012670
31) consfin_credibility            0.012485
32) number_of_trans_from_2011      0.012200
33) trans_activity_month           0.012156
34) apply_credibility              0.011999
35) max_cumulative_consume_later_1_month 0.011906
36) latest_six_month_apply         0.011830
37) consfin_max_limit              0.011702
38) loans_settle_count             0.011602
39) consfin_credit_limit           0.011546
40) loans_max_limit                0.011419
41) latest_six_month_loan          0.011267
42) rank_trad_1_month              0.011264
43) query_finance_count            0.011121
44) loans_credibility_limit        0.010998
45) avg_price_top_last_12_valid_month 0.010975
46) loans_credibility_behavior     0.010795
47) consume_mini_time_last_1_month 0.010740
48) consume_top_time_last_6_month  0.010649
49) pawns_auctions_trusts_consume_last_1_month 0.010599
50) loans_org_count_behavior       0.010529
51) loans_product_count            0.010267
52) loans_long_time                0.010188
53) trans_top_time_last_1_month    0.010051
54) middle_volume_percent          0.009383
55) query_org_count                0.009220
56) trans_days_interval            0.009068
57) latest_one_month_apply         0.008750
58) loans_count                    0.008037
59) latest_three_month_apply       0.007853
60) latest_three_month_loan        0.007771
61) take_amount_in_later_12_month_highest 0.007490
62) loans_org_count_current        0.007430
63) consume_top_time_last_1_month  0.007418
64) consfin_org_count_current      0.007333
65) top_trans_count_last_1_month   0.007296
66) consfin_product_count          0.007164
67) max_consume_count_later_6_month 0.006900
68) query_cash_count               0.006763
69) cross_consume_count_last_1_month 0.006758
70) consfin_org_count_behavior     0.006517
71) latest_one_month_suc           0.006140
72) low_volume_percent             0.005943
73) loans_cash_count               0.005465
74) latest_one_month_loan          0.005230
75) avg_consume_less_12_valid_month 0.003915
76) regional_mobility              0.003877
77) student_feature                0.002383
78) reg_preference_for_trad_一线城市   0.001258
79) reg_preference_for_trad_三线城市   0.001258
80) reg_preference_for_trad_二线城市   0.001069
81) reg_preference_for_trad_境外     0.000674
82) railway_consume_count_last_12_month 0.000492
83) is_high_user                   0.000328
84) jewelry_consume_count_last_6_month 0.000000
85) reg_preference_for_trad_其他城市   0.000000

 

按重要性设置X_train和y_train

X_train = X_train[ele_important]
X_test = X_test[ele_important]

 

特征选取暂时结束,待完善及深入。。。。

待做的:标准化数据

标准化数据(零均值和单位方差)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print("mean:", np.mean(X_train, axis=0),
      np.mean(X_test, axis=0))  # 均值应该为0
print('std:', np.std(X_train, axis=0),
      np.std(X_test, axis=0))   # 标准差应该为1

 

 

--- End ---

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值