DataWhale一周算法进阶2---特征工程（半成品。。继续改进）

最新推荐文章于 2022-04-05 10:34:40 发布

大力壮壮

最新推荐文章于 2022-04-05 10:34:40 发布

阅读量335

点赞数

分类专栏：算法项目

本文链接：https://blog.csdn.net/weixin_40363627/article/details/85987930

版权

算法项目专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

一任务

特征选择：分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM），进行模型评估。

二特征工程

一特征工程

特征工程指的是把原始数据转变为模型的训练数据的过程，它的目的就是获取更好的训练数据特征，使得机器学习模型逼近这个上限。

包括特征构建、特征提取、特征选择三个部分。

特征提取与特征选择都是为了从原始特征中找出最有效的特征。它们之间的区别是特征提取强调通过特征转换的方式得到一组具有明显物理或统计意义的特征；而特征选择是从特征集合中挑选一组具有明显物理或统计意义的特征子集。

两者都能帮助减少特征的维度、数据冗余，特征提取有时能发现更有意义的特征属性，特征选择的过程经常能表示出每个特征的重要性对于模型构建的重要性。

二特征选择

在我们做特征工程时，当我们提取完特征后，可能存在并不是所有的特征都能分类起到作用的问题，这个时候就需要使用特征选择的方法选出相对重要的特征用于构建分类器。通常来说，从两个方面考虑来选择特征：

特征是否发散：如果一个特征不发散，例如方差接近于0，也就是说样本在这个特征上基本上没有差异，这个特征对于样本的区分并没有什么用。

特征与目标的相关性：这点比较显见，与目标相关性高的特征，应当优选选择。除方差法外，本文介绍的其他方法均从相关性考虑。

三特征选择——IV值

IV值（Information Value），即信息价值指标，是评分卡模型中的一个常见指标，在金融风控领域得到了广泛的应用，尤其是在特征选择的场景下，会经常提及这个概念。

IV值衡量了某个特征对目标的影响程度，其基本思想是根据该特征所命中黑白样本的比率与总黑白样本的比率，来对比和计算其关联程度，计算公式如下：
IV值计算公式
当我们计算出特征的IV值后，该如何去解释它的预测能力。或者说，当IV值取到多大时，我们才选择这个特征。这里给出一个经验参考表：

IV值	预测能力
< 0.02	无预测能力
0.02 ~ 0.1	较弱的预测能力
0.1 ~ 0.3	预测能力一般
0.3 ~0.5	较强的预测能力
> 0.5	可疑

由表我们可以知道，并不是IV值越大越好，当IV值大于0.5时，我们需要对这个特征打个疑问，因为它过于太好而显得不够真实。通常我们会选择IV值在0.1 ～ 0.5这个范围的特征。可能不同场景在取值的细节上会有所不同，比如某些风控团队会将IV值大于0.05的特征也纳入考虑范畴，而学术界则有观点认为选择0.1~0.3这个范围会更好。

四特征选择——随机森林

在特征选择的许多方法中，我们可以使用随机森林模型中的特征重要属性来筛选特征，并得到其与分类的相关性。由于随机森林存在的固有随机性，该模型可能每次给予特征不同的重要性权重。但是通过多次训练该模型，即每次通过选取一定量的特征与上次特征中的交集进行保留，以此循环一定次数，从而我们最后可以得到一定量对分类任务的影响有重要贡献的特征。

三代码实现

一 python IV值的实现（代码未调通）

# 其中DF是导入的数据，Y是因变量的字段名，X是自变量的字段名
def woe_single(DF,Y,X):
    if X.nunique()>7:
        r = 0
        bad=Y.sum()      #坏客户数(假设因变量列为1的是坏客户)
        good=Y.count()-bad  #好客户数
        n=6
        while np.abs(r) < 1:
            d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n,duplicates='drop')})
            d2 = d1.groupby('Bucket', as_index = False)
            r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
            n = n - 1
        d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
        d3['min']=d2.min().X    
        d3['max'] = d2.max().X
        d3['sum'] = d2.sum().Y
        d3['total'] = d2.count().Y
        d3['bad_rate'] = d2.mean().Y
        d3['group_rate']=d3['total']/(bad+good)
        d3['woe']=np.log((d3['bad_rate']/(1-d3['bad_rate']))/(bad/good))
        d3['iv']=(d3['sum']/bad-((d3['total']-d3['sum'])/good))*d3['woe']
        iv=d3['iv'].sum()
        if iv!=0.0 and len(d2)>1:
            d3['iv_sum']=iv
            woe=list(d3['woe'].round(6))
            cut=list(d3['min'].round(6))
            cut1=list(d3['max'].round(6))
            cut.append(cut1[-1]+1)
            x_woe=pd.cut(X,cut,right=False,labels=woe)
            return  d3,cut,woe,iv,x_woe
        else:
            dn1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.cut(X, 10000)})
            dn2 = dn1.groupby('Bucket', as_index = False)
            dn3 = pd.DataFrame(dn2.X.min(), columns = ['min'])
            dn3['min']=dn2.min().X    
            dn3['max'] = dn2.max().X
            dn3['sum'] = dn2.sum().Y
            dn3['total'] = dn2.count().Y
            dn3=dn3.dropna()
            dn3= dn3.reset_index(drop=True)
            while (1):
                    if  (len(dn3)>4):
                        dn3_min_index = dn3[dn3.total == min(dn3.total)].index.values[0]
                        if (dn3_min_index!=0):    #最小值非第一行的情况
                            dn3.iloc[dn3_min_index-1, 1] =dn3.iloc[dn3_min_index, 1] 
                            dn3.iloc[dn3_min_index-1, 2] = dn3.iloc[dn3_min_index, 2] +dn3.iloc[dn3_min_index-1, 2]
                            dn3.iloc[dn3_min_index-1, 3] = dn3.iloc[dn3_min_index, 3] +dn3.iloc[dn3_min_index-1, 3]
                            dn3=dn3.drop([dn3_min_index])
                            dn3= dn3.reset_index(drop=True)
                        else:    #最小值是第一行的情况
                            dn3.iloc[dn3_min_index+1, 0] =dn3.iloc[dn3_min_index, 0] 
                            dn3.iloc[dn3_min_index+1, 2] = dn3.iloc[dn3_min_index, 2] +dn3.iloc[dn3_min_index+1, 2]
                            dn3.iloc[dn3_min_index+1, 3] = dn3.iloc[dn3_min_index, 3] +dn3.iloc[dn3_min_index+1, 3]
                            dn3=dn3.drop([dn3_min_index])
                            dn3= dn3.reset_index(drop=True)
                    else:
                        break
            dn3['bad_rate'] =dn3['sum']/dn3['total']
            dn3['group_rate']=dn3['total']/(bad+good)
            dn3['woe']=np.log((dn3['bad_rate']/(1-dn3['bad_rate']))/(bad/good))
            dn3['iv']=(dn3['sum']/bad-((dn3['total']-dn3['sum'])/good))*dn3['woe']
            
            iv=dn3['iv'].sum()
            dn3['iv_sum']=iv
            woe=list(dn3['woe'].round(6)) 
            cut=list(dn3['min'].round(6))
            cut1=list(dn3['max'].round(6))
            cut.append(cut1[-1]+1)
            x_woe=pd.cut(X,cut,right=False,labels=woe)
            return  dn3,cut,woe,iv,x_woe
    else : 
        bad=Y.sum()      #坏客户数
        good=Y.count()-bad  #好客户数
        d1 = pd.DataFrame({"X": X, "Y": Y})
        d2 = d1.groupby('X', as_index =True)
        d3 = pd.DataFrame()
        
        d3['sum'] = d2.sum().Y
        d3['total'] = d2.count().Y
        for c in range(d3.shape[0])[::-1]:
            if ((d3.iloc[c,1]-d3.iloc[c,0])==0) or (d3.iloc[c,0]==0):
                d3.iloc[c-1,0]=d3.iloc[c-1,0]+d3.iloc[c,0]
                d3.iloc[c-1,1]=d3.iloc[c-1,1]+d3.iloc[c,1]
                d3.drop(d3.index[c],inplace=True)
            else:
                continue
        
        d3['min']=d3.index  
        d3['max'] = d3.index
        d3['bad_rate'] =d3['sum']/d3['total']
        d3['group_rate']=d3['total']/(bad+good)
        d3['woe']=np.log((d3['bad_rate']/(1-d3['bad_rate']))/(bad/good))
        d3['iv']=(d3['sum']/bad-((d3['total']-d3['sum'])/good))*d3['woe']
        iv=d3['iv'].sum()
        d3['iv_sum']=iv
        d3= d3.reset_index(drop=True)
        d3=d3[['min','max','sum','total','bad_rate','group_rate','woe','iv','iv_sum']]
        
        
        woe=list(d3['woe'].round(6))
        cut=list(d3.index)
        x_woe=X.replace(cut,woe)
        return d3,cut,woe,iv,x_woe

二特征选择——随机森林代码的实现

for i in range(10):                           #这里我们进行十次循环取交集
    tmp = set()
    rfc = RandomForestClassifier(n_jobs=-1)
    rfc.fit(X_train, y_train)
    print("training finished")
 
    importances = rfc.feature_importances_
    indices = np.argsort(importances)[::-1]   # 降序排列
    for f in range(X.shape[1]):
        if f < 50:                            #选出前50个重要的特征
            tmp.add(X.columns[indices[f]])
        print("%2d) %-*s %f" % (f + 1, 30, X.columns[indices[f]], importances[indices[f]]))
 
    selected_feat_names &= tmp
    print(len(selected_feat_names), "features are selected")

结果

training finished
 1) jewelry_consume_count_last_6_month 0.682672
 2) history_suc_fee                0.033073
 3) cross_consume_count_last_1_month 0.032572
 4) latest_one_month_suc           0.014849
 5) latest_six_month_apply         0.012684
 6) apply_score                    0.009905
 7) rank_trad_1_month              0.009350
 8) trans_fail_top_count_enum_last_1_month 0.008200
 9) historical_trans_day           0.006529
10) history_fail_fee               0.005333
11) trans_days_interval            0.005151
12) consfin_org_count_current      0.004909
13) consume_mini_time_last_1_month 0.004899
14) Unnamed: 0                     0.004724
15) loans_count                    0.004721
16) latest_six_month_loan          0.004601
17) first_transaction_time         0.004511
18) trans_amount_increase_rate_lately 0.004307
19) trans_top_time_last_1_month    0.003979
20) loans_settle_count             0.003939
21) loans_latest_time              0.003826
22) loans_credibility_behavior     0.003769
23) consfin_max_limit              0.003755
24) latest_one_month_loan          0.003477
25) student_feature                0.003464
26) trans_days_interval_filter     0.003419
27) number_of_trans_from_2011      0.003349
28) middle_volume_percent          0.003332

四问题

1.IV值的问题我会继续看
2.随机森林的结果这么烂。。。。应该是一开始数据预处理有问题。继续加油吧。。。。。。。。

Reference

https://www.cnblogs.com/wxquare/p/5484636.html
https://www.cnblogs.com/geo-will/p/9626734.html
https://www.jianshu.com/p/bd350351aa5c
https://blog.csdn.net/law_130625/article/details/73477218
https://blog.csdn.net/banbuduoyujian/article/details/60328474

大力壮壮

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
DataWhale一周算法进阶2---特征工程（半成品。。继续改进）

文章目录一任务二特征工程一特征工程二特征选择三特征选择——IV值四特征选择——随机森林Reference一任务特征选择：分别用IV值和随机森林进行特征选择。再用【算法实践】中的7个模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM），进行模型评估。二特征工程一特征工程特征工程指的是把原始数据转变为模型的训练数据的过程，它的目的就是获取更...
复制链接

扫一扫