Kaggle_Kernel学习_Home Credit Default Risk_特征工程_baseline部分

在学习了数据分析基本知识后,希望通过实战练习的方式巩固知识,入门数据竞赛,也和大家分享一些kaggle上的入门kernel,整合一下自己所用的资源和学习的过程

Home Credit Default Risk

地址:https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
作者主页:https://www.kaggle.com/willkoehrsen
赛题背景:Home Credit利用各种替代数据(包括电信和交易信息)来预测其客户的还款能力。
数据资源:https://pan.baidu.com/s/1uaMESw1ca_Y9O3YVfrSrEA
提取码:d13i
内容:以原作者内容为主,结合对部分知识点我自己的理解和学习,也写一些自己实际操作中的思考,适合初学者学习。

 

前文链接:

Kaggle_Kernel学习_Home Credit Default Risk_EDA

在上一篇博客中我们分别对相关系数最突出的正/负相关变量做了探索,也根据他们的影响因素,这篇文章讨论的是特征工程和机器学习的一些相关工作。

 

特征工程是指一个基本过程,可以包括特征构造:从现有数据中添加新特征,以及特征选择:仅选择最重要的特征或其他降维方法。 我们可以使用许多技术来创建特征和选择特征。本文主要采用PolynomialFeatures 和 Domain knowledge features 两个方面的知识来构建、调整特征。

1.多项式特征生成 PolynomialFeatures

在此方法中,我们创建的功能是现有的强相关的特征以及这些特征之间的交互项。例如,我们可以创建变量EXT_SOURCE_1 ^ 2和EXT_SOURCE_2 ^ 2以及变量,例如EXT_SOURCE_1 x EXT_SOURCE_2,EXT_SOURCE_1 xEXT_SOURCE_2 ^ 2,EXT_SOURCE_1 ^ 2 x EXT_SOURCE_2 ^2,
依此类推。这些由多个单独变量组合而来的功能称为交互项,因为它们捕获变量之间的交互。换句话说,虽然两个变量本身可能不会对目标产生强烈影响,但将它们组合在一起形成一个交互变量可能会显示与目标的关系。

 

# 提取出强相关的几个特征和TARGET属性
poly_features_train = data_train[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH','TARGET']]
poly_features_test = data_test[['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']]
# 单独保存TARGET
poly_target = poly_features_train['TARGET']
# 去除 TARGET 保持 test / train 一致
poly_features_train = poly_features_train.drop(columns=['TARGET'])

改动之前先通过Imputer进行插值,填补缺失

from sklearn.preprocessing import Imputer
# 策略采取为中值填补
impt = Imputer(strategy='median')
# 直接用 fit_transform 方法进行变换
# 要注意的是得到的结果是数字矩阵
poly_features_train = impt.fit_transform(poly_features_train)
poly_features_test = impt.fit_transform(poly_features_test)
poly_features_train
array([[8.30369674e-02, 2.62948593e-01, 1.39375780e-01, 9.46100000e+03],
       [3.11267311e-01, 6.22245775e-01, 5.35276250e-01, 1.67650000e+04],
       [5.05997931e-01, 5.55912083e-01, 7.29566691e-01, 1.90460000e+04],
       ...,
       [7.44026400e-01, 5.35721752e-01, 2.18859082e-01, 1.49660000e+04],
       [5.05997931e-01, 5.14162820e-01, 6.61023539e-01, 1.19610000e+04],
       [7.34459669e-01, 7.08568896e-01, 1.13922396e-01, 1.68560000e+04]])
调用PolynomialFeatures进行转换
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=3)
poly_transformer.fit(poly_features_train)
poly_features_train = poly_transformer.transform(poly_features_train)
poly_features_test = poly_transformer.transform(poly_features_test)

由于失去了原来的 columns 需要重新命名

# 调用 get_feature_names()方法查看 degree = 3 时 输入变量生成的多项式组成
# 注意要和之前构造 DataFrame 时顺序相同
poly_transformer.get_feature_names(input_features=['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]
['1',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'DAYS_BIRTH',
 'EXT_SOURCE_1^2',
 'EXT_SOURCE_1 EXT_SOURCE_2',
 'EXT_SOURCE_1 EXT_SOURCE_3',
 'EXT_SOURCE_1 DAYS_BIRTH',
 'EXT_SOURCE_2^2',
 'EXT_SOURCE_2 EXT_SOURCE_3',
 'EXT_SOURCE_2 DAYS_BIRTH',
 'EXT_SOURCE_3^2',
 'EXT_SOURCE_3 DAYS_BIRTH',
 'DAYS_BIRTH^2',
 'EXT_SOURCE_1^3',
 'EXT_SOURCE_1^2 EXT_SOURCE_2',
 'EXT_SOURCE_1^2 EXT_SOURCE_3',
 'EXT_SOURCE_1^2 DAYS_BIRTH',
 'EXT_SOURCE_1 EXT_SOURCE_2^2',
 'EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3',
 'EXT_SOURCE_1 EXT_SOURCE_2 DAYS_BIRTH',
 'EXT_SOURCE_1 EXT_SOURCE_3^2',
 'EXT_SOURCE_1 EXT_SOURCE_3 DAYS_BIRTH',
 'EXT_SOURCE_1 DAYS_BIRTH^2',
 'EXT_SOURCE_2^3',
 'EXT_SOURCE_2^2 EXT_SOURCE_3',
 'EXT_SOURCE_2^2 DAYS_BIRTH',
 'EXT_SOURCE_2 EXT_SOURCE_3^2',
 'EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH',
 'EXT_SOURCE_2 DAYS_BIRTH^2',
 'EXT_SOURCE_3^3',
 'EXT_SOURCE_3^2 DAYS_BIRTH',
 'EXT_SOURCE_3 DAYS_BIRTH^2',
 'DAYS_BIRTH^3']
# 利用现有的列名和数据转换为 Dataframe
poly_features_train = pd.DataFrame(poly_features_train,
                                   columns=poly_transformer.get_feature_names([
                                       'EXT_SOURCE_1', 'EXT_SOURCE_2',
                                       'EXT_SOURCE_3', 'DAYS_BIRTH'
                                   ]))
poly_features_test = pd.DataFrame(poly_features_test,
                                 columns=poly_transformer.get_feature_names([
                                      'EXT_SOURCE_1', 'EXT_SOURCE_2',
                                       'EXT_SOURCE_3', 'DAYS_BIRTH'
                                 ]))
poly_features_train.head() 
查看新属性相关性
# 目标函数增长
poly_features['TARGET'] = poly_target
# 找到相关系数
poly_corrs = poly_features.corr()['TARGET'].sort_values()
# 查看极值
print('head:\n',poly_corrs.head(5))
print('\ntail:\n',poly_corrs.tail(5))
head:
EXT_SOURCE_2 EXT_SOURCE_3                -0.193939
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3   -0.189605
EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH     -0.181283
EXT_SOURCE_2^2 EXT_SOURCE_3              -0.176428
EXT_SOURCE_2 EXT_SOURCE_3^2              -0.172282
Name: TARGET, dtype: float64

tail:
DAYS_BIRTH     -0.078239
DAYS_BIRTH^2   -0.076672
DAYS_BIRTH^3   -0.074273
TARGET          1.000000
1                    NaN
Name: TARGET, dtype: float64
将得到的新特征结合到原来的 DataFrame 中
# 选取 SK_ID_CURR 作为连接列
poly_features_train['SK_ID_CURR'] = data_train['SK_ID_CURR']
poly_features_test['SK_ID_CURR'] = data_test['SK_ID_CURR']

# 参数 on 来指定用于数据集合并的主键
data_train_poly = data_train.merge(poly_features_train, on='SK_ID_CURR', how='left')
data_test_poly = data_test.merge(poly_features_test, on='SK_ID_CURR', how='left')
# 统一列
data_train_poly, data_test_poly = data_train_poly.align(data_test_poly, join='inner',axis=1)
# 查看维度
print('多项式生成后的训练集维度: ', data_train_poly.shape)
print('多项式生成后的测试集维度: ', data_test_poly.shape)
多项式生成后的训练集维度:  (307511, 275)
多项式生成后的测试集维度:  (48744, 275)

 

2. 领域知识构造新特征

CREDIT_INCOME_PERCENT:   信用额相对于客户收入的百分比
ANNUITY_INCOME_PERCENT:贷款年金相对于客户收入的百分比
CREDIT_TERM:   以月为单位的付款期限(因为年金是每月到期金额)
DAYS_EMPLOYED_PERCENT:  相对于客户年龄的就业天数百分比
data_train_domain = data_train.copy()
data_test_domain = data_test.copy()

# 分别构造特征

# 训练集部分构造
data_train_domain['CREDIT_INCOME_PERCENT'] = data_train_domain['AMT_CREDIT'] / data_train_domain['AMT_INCOME_TOTAL']
data_train_domain['ANNUITY_INCOME_PERCENT'] = data_train_domain['AMT_ANNUITY'] / data_train_domain['AMT_INCOME_TOTAL']
data_train_domain['CREDIT_TERM'] = data_train_domain['AMT_ANNUITY'] / data_train_domain['AMT_CREDIT']
data_train_domain['DAYS_EMPLOYED_PERCENT'] = data_train_domain['DAYS_EMPLOYED'] / data_train_domain['DAYS_BIRTH']

# 测试集部分构造
data_test_domain['CREDIT_INCOME_PERCENT'] = data_test_domain['AMT_CREDIT'] / data_test_domain['AMT_INCOME_TOTAL']
data_test_domain['ANNUITY_INCOME_PERCENT'] = data_test_domain['AMT_ANNUITY'] / data_test_domain['AMT_INCOME_TOTAL']
data_test_domain['CREDIT_TERM'] = data_test_domain['AMT_ANNUITY'] / data_test_domain['AMT_CREDIT']
data_test_domain['DAYS_EMPLOYED_PERCENT'] = data_test_domain['DAYS_EMPLOYED'] / data_test_domain['DAYS_BIRTH']
plt.figure(figsize = (8, 16))
# 迭代
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']): 
    # subplot 画子图
    plt.subplot(4, 1, i + 1)
    
    # 偿还
    sns.kdeplot(data_train_domain.loc[data_train_domain['TARGET'] == 0, feature], label = 'target == 0')
    # 逾期
    sns.kdeplot(data_train_domain.loc[data_train_domain['TARGET'] == 1, feature], label = 'target == 1')
    
    # 标签
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature);
    plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

在这里插入图片描述

data_corr = data_train_domain[['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT','TARGET']]
data_corr.corr()['TARGET']
CREDIT_INCOME_PERCENT    -0.007727
ANNUITY_INCOME_PERCENT    0.014265
CREDIT_TERM               0.012704
DAYS_EMPLOYED_PERCENT     0.067955
TARGET                    1.000000
Name: TARGET, dtype: float64
看了一眼相关系数,关于是否有作用这一点还很难说…

 

BASELINE

之前的处理有些只是为了探索数据的可能性,并未完整的预处理数据,这里依然需要先走一套
preprocessing的流程

# 导入相关模块 包括归一化 / 插值
from sklearn.preprocessing import MinMaxScaler, Imputer

# 分离 TARGET 
if 'TARGET' in data_train:
    train = data_train.drop(columns = ['TARGET'])
else:
    train = data_train.copy()

# 测试集 
test = data_test.copy()
# 中值填补
imputer = Imputer(strategy = 'median')

# 数据范围转换到到 0-1 之间
scaler = MinMaxScaler(feature_range = (0, 1))

# 训练
imputer.fit(train)
scaler.fit(train)
# Imputer 转换
train = imputer.transform(train)
test = imputer.transform(test)

# MinMaxScaler 转换
train = scaler.transform(train)
test = scaler.transform(test)
print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)
Training data shape:  (307511, 240)
Testing data shape:  (48744, 240)

 

做一个baseline预测
from sklearn.linear_model import LogisticRegression

# 不进行 search
log_reg = LogisticRegression(C = 0.0001)

#训练
log_reg.fit(train, train_labels)
LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)
 
# 运用 Predict_proba() 方法将预测结果为 1 的可能性
# 转化为对最终逾期的概率 的预测
log_reg_pred = log_reg.predict_proba(test)[:, 1]
# 构造 Dataframe 准备提交
submit = data_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred
submit.head()
	SK_ID_CURR	TARGET
0	100001		0.087750
1	100005		0.163957
2	100013		0.110238
3	100028		0.076575
4	100038		0.154924
# 保存为 csv
submit.to_csv('log_reg_baseline.csv', index = False)
模型的具体选择改进等后续 ,在对算法这一块儿看多了以后再做,不误导别人。

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值