一个信用评分案例看机器学习建模基本过程

最新推荐文章于 2024-06-21 12:29:22 发布

程序员酱油哥

最新推荐文章于 2024-06-21 12:29:22 发布

阅读量2.5k

点赞数 6

分类专栏： 04【教程】机器学习文章标签：信用评分 credit scoring

本文链接：https://blog.csdn.net/qintian888/article/details/93752450

版权

本文通过一个信用评分案例，介绍了机器学习建模的基本过程，包括数据预处理、模型训练、评估和优化。实践中使用了逻辑回归、SVM、KNN等模型，并通过特征选择提升模型性能。

摘要由CSDN通过智能技术生成

一个信用评分案例看机器学习建模基本过程

- machine learning for credit scoring

machine learning for credit scoring

练习0数据预处理

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. Dataset

Attribute Information:

Variable Name	Description	Type
SeriousDlqin2yrs	Person experienced 90 days past due delinquency or worse	Y/N
RevolvingUtilizationOfUnsecuredLines	Total balance on credit divided by the sum of credit limits	percentage
age	Age of borrower in years	integer
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due	integer
DebtRatio	Monthly debt payments	percentage
MonthlyIncome	Monthly income	real
NumberOfOpenCreditLinesAndLoans	Number of Open loans	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due.	integer
NumberRealEstateLoansOrLines	Number of mortgage and real estate loans	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due	integer
NumberOfDependents	Number of dependents in family	integer

Read the data into Pandas 将数据读进pandas

import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	0.766127	45.0	2.0	0.802982	9120.0	13.0	0.0	6.0	2.0
1	0	0.957151	40.0	0.0	0.121876	2600.0	4.0	0.0	0.0	1.0
2	0	0.658180	38.0	1.0	0.085113	3042.0	2.0	1.0	0.0	0.0
3	0	0.233810	30.0	0.0	0.036050	3300.0	5.0	0.0	0.0	0.0
4	0	0.907239	49.0	1.0	0.024926	63588.0	7.0	0.0	1.0	0.0

data.shape

(112915, 11)

去除异常值 Drop na

data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

data.dropna(inplace=True)
data.shape

(108648, 11)

创建X 和 y Create X and y

y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

y.mean()

0.06742876076872101

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.countplot(x='SeriousDlqin2yrs',data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x24081eb9828>

在这里插入图片描述

#从样本中可以看出：label为1的样本偏少，可见样本失衡

练习1：数据集准备

把数据切分成训练集和测试集

切分数据集

# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

对连续值特征做幅度缩放

from sklearn.preprocessing import StandardScaler

stdsc=StandardScaler()
X_train_std=stdsc.fit_transform(X_train)
X_test_std=stdsc.transform(X_test)

练习2使用不同模型分类

使用logistic regression/决策树/SVM/KNN…等sklearn分类算法进行分类，尝试查sklearn API了解模型参数含义，调整不同的参数。

logistic regression


from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1',C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
lr

LogisticRegression(C=1000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)