客户还款可能性预测

目的:根据用户的申请表来判别用户是否会按时还款。

1. 导入数据

import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
import os
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.font_manager import FontProperties
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
app_train = pd.read_csv('application_train.csv')
app_train.head()

在这里插入图片描述

# 查看数据量
app_train.shape

在这里插入图片描述

2. 数据展示

2.1 缺失值展示

def missing_value_table(df):
    #计算所有的缺失值
    mis_val = df.isnull().sum()
    # 缺失值的百分比
    mis_val_percent = 100*df.isnull().sum()/len(df)
    #合并
    mis_val_table = pd.concat([mis_val,mis_val_percent],axis=1)
    mis_val_rename = mis_val_table.rename(columns = {0:'缺失的数量',1:'缺失百分比'})
    #剔除完整的并排序
    mis_val_rename = mis_val_rename[mis_val_rename.iloc[:,1]!=0].sort_values('缺失百分比',ascending=False)
    return mis_val_rename

missing_value_table(app_train)[:10]

在这里插入图片描述

2.2 object类型处理

原则:选择所有的object类型,当特征值>2 用独热编码;否则用label encoder

# 查看数据的类型
app_train.dtypes.value_counts()

在这里插入图片描述

le = LabelEncoder()
for col in app_train:
    if app_train[col].dtype == 'object':
        if len(list(app_train[col].unique()))<=2:
            le.fit(app_train[col])
            app_train[col]=le.transform(app_train[col])
app_train = pd.get_dummies(app_train)
app_train.shape

在这里插入图片描述

3. 数据分析

3.1 年龄

train_labels = app_train['TARGET']
(app_train['DAYS_BIRTH']/-365).describe()

在这里插入图片描述
用户平均年龄40多岁。

plt.figure(figsize=(18,12))
plt.hist(app_train['DAYS_BIRTH']/-365,edgecolor='k',bins=25)
plt.show()

在这里插入图片描述
申请用户的年龄分布在20-70岁,并且年龄在40岁左右的中年人申请最多。

# 看还钱人和不还钱人的年龄分布  
plt.figure(figsize=(15,11))
sns.kdeplot(app_train.loc[app_train['TARGET']==0,'DAYS_BIRTH']/-365,label='按时还钱')
sns.kdeplot(app_train.loc[app_train['TARGET']==1,'DAYS_BIRTH']/-365,label='未按时还钱')
plt.show()

在这里插入图片描述

3.2 相关系数

# 相关系数 看一下与最终分类最相关的属性
correlations = app_train.corr()['TARGET'].sort_values(ascending = False)
correlations.head()   # target指标中 0:还钱 1:不还钱

在这里插入图片描述

ext_data = app_train[['TARGET','DAYS_BIRTH','DAYS_EMPLOYED','REGION_RATING_CLIENT_W_CITY','REGION_RATING_CLIENT']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

在这里插入图片描述

plt.figure(figsize=(15,10))
sns.heatmap(ext_data_corrs,cmap = plt.cm.RdYlBu_r,annot=True)
plt.show()

在这里插入图片描述

3.3 工龄

# 工龄
(app_train['DAYS_EMPLOYED']/-365).describe()

在这里插入图片描述
平均工龄为174,不符合实际,可能存在离群点。

app_train['DAYS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED'] == 365243 # 离群点
app_train['DAYS_EMPLOYED'].replace({365243:np.nan},inplace=True)       # 把离群点换为NAN
app_train['DAYS_EMPLOYED'].plot.hist()
plt.show()

在这里插入图片描述

4. 特征工程

4.1 缺失值填充

poly_features = app_train[['TARGET','DAYS_BIRTH','DAYS_EMPLOYED','REGION_RATING_CLIENT_W_CITY','REGION_RATING_CLIENT']]
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import Imputer

#缺失值填充
imputer = Imputer(strategy='median')
poly_target = poly_features['TARGET']
poly_features.drop(columns=['TARGET'],inplace =True)
poly_features = imputer.fit_transform(poly_features)

4.2 多项式回归

poly_transformer = PolynomialFeatures(degree=3)
poly_transformer.fit(poly_features)
poly_features = poly_transformer.transform(poly_features)
poly_features.shape

在这里插入图片描述
经过处理发现:特征数由之前的5变为了35。

# 将当前得到的部分特征跟总体组合在一起
poly_features = pd.DataFrame(poly_features,columns=poly_transformer.get_feature_names(input_features=['TARGET','DAYS_BIRTH','DAYS_EMPLOYED','REGION_RATING_CLIENT_W_CITY','REGION_RATING_CLIENT']))
poly_features.head()

在这里插入图片描述

poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features,on='SK_ID_CURR',how='left')
app_train_poly.head()

在这里插入图片描述

# 创建特征
# CREDIT_INCOME_PERCENT:信用额度与工资比值
# ANNUITY_INCOME_PERCENT:用户的您收入与贷款金额的比值
# CREDIT_TERM: 还款花费的时间
# DAYS_EMPLOYED_PERCENT: 上半时间/年龄
app_train_domain = app_train.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']
plt.figure(figsize = (12, 20))
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    
    plt.subplot(4, 1, i + 1)
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = '按时还款')
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = '未按时还款')
    
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature)
    plt.ylabel('Density')
    
plt.tight_layout(h_pad = 2.5)
plt.show()

在这里插入图片描述

4.3 数据预处理

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

label = app_train['TARGET']
train = app_train.drop(columns = ['TARGET'])
train,test,y_train,y_test= train_test_split(train,label,test_size = 0.2,random_state = 0)
features = list(train.columns)

imputer = Imputer(strategy='median')
std = StandardScaler()

#填充
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)
#标准化
std.fit(train)
train = std.transform(train)
test = std.transform(test)

4.4 算法对比

4.4.1 基础模型:逻辑回归

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=0.0001)  # C:正则化惩罚力度 C越小 惩罚力度越大
log_reg.fit(train,y_train)
# 预测结果,并返回评估指标
predictions = log_reg.predict_proba(test)[:,1]

from sklearn.metrics import roc_auc_score
test_auc = roc_auc_score(y_test,predictions)
test_auc

0.74397177809054

4.4.2 随机森林

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=1000,random_state=10,n_jobs=-1)
random_forest.fit(train,y_train)

predictions = random_forest.predict_proba(test)[:,1]
test_auc = roc_auc_score(y_test,predictions)
test_auc

0.726788917878467

4.4.3 LightGBM

LightGBM (Light Gradient Boosting Machine)是一个实现 GBDT 算法的框架,支持高效率的并行训练,并且具有以下优点:

  1. 更快的训练速度
  2. 更低的内存消耗
  3. 更好的准确率
  4. 分布式支持,可以快速处理海量数据

在 Higgs 数据集上 LightGBM 比 XGBoost 快将近 10 倍,内存占用率大约为 XGBoost 的1/6,并且准确率也有提升。

import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)

model.fit(train, y_train, eval_metric = 'auc',
          eval_set = [(test, y_test), (train, y_train)],
          eval_names = ['test', 'train'],
          early_stopping_rounds = 100, verbose = 200)

Training until validation scores don’t improve for 100 rounds
[200] train’s auc: 0.795708 train’s binary_logloss: 0.550817 test’s auc: 0.75899 test’s binary_logloss: 0.565247
[400] train’s auc: 0.824662 train’s binary_logloss: 0.522404 test’s auc: 0.758982 test’s binary_logloss: 0.547585
Early stopping, best iteration is:
[300] train’s auc: 0.811231 train’s binary_logloss: 0.535615 test’s auc: 0.759415 test’s binary_logloss: 0.555607

使用LightGBM后,test’s auc提升到了0.759415

4.5 特征对比

使用包含自己构建的特征的数据集进行试验:

app_train_domain = app_train_domain.drop(columns = ['TARGET'])
train, test, y_train, y_test = train_test_split(app_train_domain, label, test_size=0.2, random_state=100)

features = list(train.columns)

imputer = Imputer(strategy = 'median')

std = StandardScaler()

imputer.fit(train)

train = imputer.transform(train)
test = imputer.transform(test)


std.fit(train)
train = std.transform(train)
test = std.transform(test)

4.5.1 随机森林

random_forest.fit(train, y_train)

predictions = random_forest.predict_proba(test)[:, 1]
test_auc = roc_auc_score(y_test, predictions)
test_auc

从0.726788917878467上升为0.7268548282953567

4.5.2 LightGBM

model.fit(train, y_train, eval_metric = 'auc',
          eval_set = [(test, y_test), (train, y_train)],
          eval_names = ['test', 'train'],
          early_stopping_rounds = 100, verbose = 200)

Training until validation scores don’t improve for 100 rounds
[200] train’s auc: 0.802673 train’s binary_logloss: 0.543939 test’s auc: 0.76511 test’s binary_logloss: 0.560088
Early stopping, best iteration is:
[296] train’s auc: 0.817961 train’s binary_logloss: 0.528392 test’s auc: 0.765577 test’s binary_logloss: 0.550564

从0.759415上升为0.765577

参考链接:LightGBM算法总结.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值