以下为我们这次的数据集信息,分别是各类特征和信用评定Label,属于二分类问题。
本文章想通过比较决策树、SVM和随机森林在该数据集上的表现
导入数据,查看缺失值
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data =pd.read_excel('./GermanCredit.xls', sheet_name='Data') #读取xls文件的Data sheet
data.head()
num_features = ['DURATION','AMOUNT','INSTALL_RATE','AGE','NUM_CREDITS','NUM_DEPENDENTS']
cat_features = data.columns.drop(num_features + ['OBS#'])
data.isnull().sum()
# 都没有缺失值
OBS# 0
CHK_ACCT 0
DURATION 0
HISTORY 0
NEW_CAR 0
USED_CAR 0
FURNITURE 0
RADIO/TV 0
EDUCATION 0
RETRAINING 0
AMOUNT 0
SAV_ACCT 0
EMPLOYMENT 0
INSTALL_RATE 0
MALE_DIV 0
MALE_SINGLE 0
MALE_MAR_or_WID 0
CO-APPLICANT 0
GUARANTOR 0
PRESENT_RESIDENT 0
REAL_ESTATE 0
PROP_UNKN_NONE 0
AGE 0
OTHER_INSTALL 0
RENT 0
OWN_RES 0
NUM_CREDITS 0
JOB 0
NUM_DEPENDENTS 0
TELEPHONE 0
FOREIGN 0
RESPONSE 0
dtype: int64
将连续特征离散化
发现DURATION是贷款期限,分布在4-72个月之间,而且分布是一个看似左偏的正态分布,做一个hist图看得更清楚!
plt.hist(data['DURATION'])
(array([171., 262., 337., 57., 86., 17., 54., 2., 13., 1.]),
array([ 4. , 10.8, 17.6, 24.4, 31.2, 38. , 44.8, 51.6, 58.4, 65.2, 72. ]),
<a list of 10 Patch objects>)
取五分位数,将DURATION特征转化成cat_features做离散化处理
x<20 dua_rank = 1
20<x<40 dua_rank = 2
40<x<60 dua_rank = 3
60<x<72 dua_rank = 4
并且创造一个新特征 dua_rank 添加在new_data中,也可以用sklearn.KBinsDiscretizer进行分箱处理
dua_rank = []
duration = data['DURATION']
for i in duration:
if i <=20:
dua_rank.append(1)
elif i<= 40:
dua_rank.append(2)
elif i < 60:
dua_rank.append(3)
else:
dua_rank.append(4)
可以看出,大部分的duration分布在rank1、2的区间内
plt.hist(dua_rank,bins = 4)
(array([554., 365., 67., 14.]),
array([1. , 1.75, 2.5 , 3.25, 4. ]),
<a list of 4 Patch objects>)
new_data = data.copy()
new_data['dua_rank'] = dua_rank
new_data.head()
OBS# | CHK_ACCT | DURATION | HISTORY | NEW_CAR | USED_CAR | FURNITURE | RADIO/TV | EDUCATION | RETRAINING | ... | OTHER_INSTALL | RENT | OWN_RES | NUM_CREDITS | JOB | NUM_DEPENDENTS | TELEPHONE | FOREIGN | RESPONSE | dua_rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 6 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 2 | 2 | 1 | 1 | 0 | 1 | 1 |
1 | 2 | 1 | 48 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | 3 |
2 | 3 | 3 | 12 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 1 | 1 | 2 | 0 | 0 | 1 | 1 |
3 | 4 | 0 | 42 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 0 | 1 | 3 |
4 | 5 | 0 | 24 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 0 | 0 | 2 |
5 rows × 33 columns
plt.hist(data['AMOUNT'])
(array([445., 293., 97., 80., 38., 19., 14., 8., 5., 1.]),
array([ 250. , 2067.4, 3884.8, 5702.2, 7519.6, 9337. , 11154.4,
12971.8, 14789.2, 16606.6, 18424. ]),
<a list of 10 Patch objects>)
我们也将AMOUNT特征分为1-10级,用十分位点作为评分标准
同样可以用sklearn.KBinsDiscretizer进行分箱离散化
percent = np.percentile(data['AMOUNT'], [i * 10 for i in range(1,10)])
amount_rank = []
for i in data['AMOUNT']:
if i < percent[0]:
amount_rank