German Credit Risk(德国信用卡违约分析)

最新推荐文章于 2024-08-13 12:29:14 发布

Yaphat

最新推荐文章于 2024-08-13 12:29:14 发布

阅读量1.1w

点赞数

分类专栏：机器学习 Kaggle Spark 文章标签： Kaggle Spark

本文链接：https://blog.csdn.net/Yaphat/article/details/68212073

版权

机器学习同时被 3 个专栏收录

40 篇文章 8 订阅

订阅专栏

Spark

5 篇文章 1 订阅

订阅专栏

Kaggle

2 篇文章 0 订阅

订阅专栏

数据信息

先看下数据格式：
这里写图片描述

总共有20个属性，1个类别特征。信息如下：

Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= … < 200 DM
A13 : … >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical)
Duration in month

Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical)
Credit amount

Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical)
Installment rate in percentage of disposable income

Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical)
Present residence since

Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical)
Age in years

Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical)
Number of existing credits at this bank

Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical)
Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no

label: (1 = Good, 2 = Bad)

特征预处理

这里按照一定方法处理成24个数值特征，主要是把每个特征离散为几个数值属性方便处理。处理后的数据如下：
这里写图片描述

spark运行

启动Hadoop，将数据上传到hdfs user目录下：

hdfs dfs -put Desktop/german_numeric.csv /user

启动spark。启动完后还需要配置下pyspark的环境，配置过程见我的另一篇博客.然后通过jupyter notebook去连接spark：

#连接spark
import os
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

如果连接成功，会出现如下图所示：
这里写图片描述

下面就是使用Spark Mllib的一些机器学习方法处理数据,这里使用了逻辑回归。

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
import numpy

#从hdfs加载进csv文件并进行数据集划分，80%用于训练，20%用于测试
data = sc.textFile("hdfs:///user/german_numeric.csv")
train, test = data.randomSplit([0.8, 0.2], seed=12345)
#print test.count(),train.count()
#182 818

#将csv文件转变成'LabeledPoint'格式数据
def parsePoint(line):
    values = [float(x.strip()) for x in line.split(',')]
    #这里需要注意下，标签必须从0开始，否则会出错
    if values[-1]==1:
        values[-1]=0
    else:
        values[-1]=1
    return LabeledPoint(values[-1],values[:24])

train_parsed = train.map(parsePoint)
test_parsed = test.map(parsePoint)
#print train_parsed.first()
#(0.0,[1.0,6.0,4.0,12.0,5.0,5.0,3.0,4.0,1.0,67.0,3.0,2.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0])

# 建立模型
model = LogisticRegressionWithLBFGS.train(train_parsed, iterations=10,numClasses=2)

#评估误差率
labelsAndPreds = train_parsed.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(train_parsed.count())
#print("Training Error = " + str(trainErr))
#Training Error = 0.213936430318

#对测试数据做预估
predictions = model.predict(test_parsed.map(lambda x: x.features))
#print predictions.collect()
#[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]