German Credit Risk(德国信用卡违约分析)

5 篇文章 1 订阅
2 篇文章 0 订阅

数据信息

先看下数据格式:
这里写图片描述

总共有20个属性,1个类别特征。信息如下:

Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= … < 200 DM
A13 : … >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical)
Duration in month

Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical)
Credit amount

Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical)
Installment rate in percentage of disposable income

Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical)
Present residence since

Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical)
Age in years

Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical)
Number of existing credits at this bank

Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical)
Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no

label: (1 = Good, 2 = Bad)

特征预处理

这里按照一定方法处理成24个数值特征,主要是把每个特征离散为几个数值属性方便处理。处理后的数据如下:
这里写图片描述

spark运行

启动Hadoop,将数据上传到hdfs user目录下:

hdfs dfs -put Desktop/german_numeric.csv /user

启动spark。启动完后还需要配置下pyspark的环境,配置过程见我的另一篇博客.然后通过jupyter notebook去连接spark:

#连接spark
import os
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

如果连接成功,会出现如下图所示:
这里写图片描述

下面就是使用Spark Mllib的一些机器学习方法处理数据,这里使用了逻辑回归。

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
import numpy

#从hdfs加载进csv文件并进行数据集划分,80%用于训练,20%用于测试
data = sc.textFile("hdfs:///user/german_numeric.csv")
train, test = data.randomSplit([0.8, 0.2], seed=12345)
#print test.count(),train.count()
#182 818

#将csv文件转变成'LabeledPoint'格式数据
def parsePoint(line):
    values = [float(x.strip()) for x in line.split(',')]
    #这里需要注意下,标签必须从0开始,否则会出错
    if values[-1]==1:
        values[-1]=0
    else:
        values[-1]=1
    return LabeledPoint(values[-1],values[:24])

train_parsed = train.map(parsePoint)
test_parsed = test.map(parsePoint)
#print train_parsed.first()
#(0.0,[1.0,6.0,4.0,12.0,5.0,5.0,3.0,4.0,1.0,67.0,3.0,2.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0])

# 建立模型
model = LogisticRegressionWithLBFGS.train(train_parsed, iterations=10,numClasses=2)

#评估误差率
labelsAndPreds = train_parsed.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(train_parsed.count())
#print("Training Error = " + str(trainErr))
#Training Error = 0.213936430318

#对测试数据做预估
predictions = model.predict(test_parsed.map(lambda x: x.features))
#print predictions.collect()
#[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]

整个的流程就如上述所说,每个部分可以再做详细的填充。

  • 0
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值