DataWhale 组队学习数据挖掘实践任务二

最新推荐文章于 2022-10-29 15:34:56 发布

方糖冰红茶

最新推荐文章于 2022-10-29 15:34:56 发布

阅读量315

点赞数

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/weixin_37855575/article/details/98869298

版权

数据挖掘专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文介绍了在数据挖掘任务中如何运用随机森林进行特征选择，通过sklearn库建立随机森林模型，并根据特征重要性筛选出重要特征。接着，文章展示了如何使用StandardScaler进行数据标准化，确保数据具有零均值和单位方差，以优化模型性能。

摘要由CSDN通过智能技术生成

任务2 - 特征工程（2天）

特征衍生

特征挑选：分别用IV值和随机森林等进行特征选择

……以及你能想到特征工程处理

由于之前没有接触过IV值，昨天也在准备一个面试，因此特征工程的任务先简单地完成吧，过两天有时间再完善。

用随机森林进行特征选择

用sklearn建立随机森林模型

from sklearn.ensemble import RandomForestClassifier

feat_labels = X_train.columns

forest = RandomForestClassifier(n_estimators=10,
random_state=2018,
n_jobs=-1)

forest.fit(X_train, y_train)

求特征的重要性并划定一个重要值，筛选出重要特征

importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
ele_important = []
for f in range(X_train.shape[1]):    # X_train.shape[1] = 85
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
    if importances[indices[f]] > 0.01: # 但重要性大于0.01时，将特征添加到列表中
        ele_important.append(feat_labels[indices[f]])

 1) history_fail_fee               0.053606
 2) trans_fail_top_count_enum_last_1_month 0.052335
 3) loans_score                    0.043602
 4) latest_one_month_fail          0.023373
 5) apply_score                    0.021318
 6) trans_amount_3_month           0.019262
 7) historical_trans_amount        0.019216
 8) historical_trans_day           0.017932
 9) loans_overdue_count            0.017879
10) trans_day_last_12_month        0.016720
11) trans_amount_increase_rate_lately 0.016153
12) abs                            0.016059
13) first_transaction_day          0.016038
14) latest_query_time              0.015707
15) trans_days_interval_filter     0.015685
16) pawns_auctions_trusts_consume_last_6_month 0.015427
17) trans_fail_top_count_enum_last_12_month 0.015320
18) trans_fail_top_count_enum_last_6_month 0.015038
19) loans_avg_limit                0.015004
20) loans_latest_time              0.014227
21) trans_top_time_last_6_month    0.014110
22) consfin_avg_limit              0.014079
23) avg_price_last_12_month        0.013732
24) history_suc_fee                0.013427
25) loans_credit_limit             0.013246
26) transd_mcc                     0.013121
27) repayment_capability           0.012875
28) trans_activity_day             0.012699
29) query_sum_count                0.012671
30) first_transaction_time         0.012670
31) consfin_credibility            0.012485
32) number_of_trans_from_2011      0.012200
33) trans_activity_month           0.012156
34) apply_credibility              0.011999
35) max_cumulative_consume_later_1_month 0.011906
36) latest_six_month_apply         0.011830
37) consfin_max_limit              0.011702
38) loans_settle_count             0.011602
39) consfin_credit_limit           0.011546
40) loans_max_limit                0.011419
41) latest_six_month_loan          0.011267
42) rank_trad_1_month              0.011264
43) query_finance_count            0.011121
44) loans_credibility_limit        0.010998
45) avg_price_top_last_12_valid_month 0.010975
46) loans_credibility_behavior     0.010795
47) consume_mini_time_last_1_month 0.010740
48) consume_top_time_last_6_month  0.010649
49) pawns_auctions_trusts_consume_last_1_month 0.010599
50) loans_org_count_behavior       0.010529
51) loans_product_count            0.010267
52) loans_long_time                0.010188
53) trans_top_time_last_1_month    0.010051
54) middle_volume_percent          0.009383
55) query_org_count                0.009220
56) trans_days_interval            0.009068
57) latest_one_month_apply         0.008750
58) loans_count                    0.008037
59) latest_three_month_apply       0.007853
60) latest_three_month_loan        0.007771
61) take_amount_in_later_12_month_highest 0.007490
62) loans_org_count_current        0.007430
63) consume_top_time_last_1_month  0.007418
64) consfin_org_count_current      0.007333
65) top_trans_count_last_1_month   0.007296
66) consfin_product_count          0.007164
67) max_consume_count_later_6_month 0.006900
68) query_cash_count               0.006763
69) cross_consume_count_last_1_month 0.006758
70) consfin_org_count_behavior     0.006517
71) latest_one_month_suc           0.006140
72) low_volume_percent             0.005943
73) loans_cash_count               0.005465
74) latest_one_month_loan          0.005230
75) avg_consume_less_12_valid_month 0.003915
76) regional_mobility              0.003877
77) student_feature                0.002383
78) reg_preference_for_trad_一线城市   0.001258
79) reg_preference_for_trad_三线城市   0.001258
80) reg_preference_for_trad_二线城市   0.001069
81) reg_preference_for_trad_境外     0.000674
82) railway_consume_count_last_12_month 0.000492
83) is_high_user                   0.000328
84) jewelry_consume_count_last_6_month 0.000000
85) reg_preference_for_trad_其他城市   0.000000

按重要性设置X_train和y_train

X_train = X_train[ele_important]
X_test = X_test[ele_important]

特征选取暂时结束，待完善及深入。。。。

待做的：标准化数据

标准化数据（零均值和单位方差）

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print("mean:", np.mean(X_train, axis=0),
np.mean(X_test, axis=0)) # 均值应该为0
print('std:', np.std(X_train, axis=0),
np.std(X_test, axis=0)) # 标准差应该为1

--- End ---

方糖冰红茶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DataWhale 组队学习数据挖掘实践任务二

任务2 - 特征工程（2天）特征衍生特征挑选：分别用IV值和随机森林等进行特征选择……以及你能想到特征工程处理由于之前没有接触过IV值，昨天也在准备一个面试，因此特征工程的任务先简单地完成吧，过两天有时间再完善。用随机森林进行特征选择用sklearn建立随机森林模型from sklearn.ensemble import RandomForestClass...
复制链接

扫一扫