一个信用评分案例看机器学习建模基本过程

machine learning for credit scoring

练习0数据预处理

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. Dataset

Attribute Information:

Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit divided by the sum of credit limits percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due integer
DebtRatio Monthly debt payments percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due integer
NumberOfDependents Number of dependents in family integer
Read the data into Pandas 将数据读进pandas
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()
SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 0.766127 45.0 2.0 0.802982 9120.0 13.0 0.0 6.0 0.0 2.0
1 0 0.957151 40.0 0.0 0.121876 2600.0 4.0 0.0 0.0 0.0 1.0
2 0 0.658180 38.0 1.0 0.085113 3042.0 2.0 1.0 0.0 0.0 0.0
3 0 0.233810 30.0 0.0 0.036050 3300.0 5.0 0.0 0.0 0.0 0.0
4 0 0.907239 49.0 1.0 0.024926 63588.0 7.0 0.0 1.0 0.0 0.0
data.shape
(112915, 11)
去除异常值 Drop na
data.isnull().sum(axis=0)
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64
data.dropna(inplace=True)
data.shape
(108648, 11)
创建X 和 y Create X and y
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)
y.mean()
0.06742876076872101
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.countplot(x='SeriousDlqin2yrs',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x24081eb9828>

在这里插入图片描述

#从样本中可以看出:label为1的样本偏少,可见样本失衡

练习1:数据集准备

把数据切分成训练集和测试集

切分数据集
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
对连续值特征做幅度缩放
from sklearn.preprocessing import StandardScaler

stdsc=StandardScaler()
X_train_std=stdsc.fit_transform(X_train)
X_test_std=stdsc.transform(X_test)

练习2使用不同模型分类

使用logistic regression/决策树/SVM/KNN…等sklearn分类算法进行分类,尝试查sklearn API了解模型参数含义,调整不同的参数。

logistic regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1',C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
lr
LogisticRegression(C=1000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

                
  • 6
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值