数据信息
先看下数据格式:
总共有20个属性,1个类别特征。信息如下:
Attribute 1: (qualitative)
Status of existing checking account
A11 : … < 0 DM
A12 : 0 <= … < 200 DM
A13 : … >= 200 DM / salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : … < 100 DM
A62 : 100 <= … < 500 DM
A63 : 500 <= … < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : … < 1 year
A73 : 1 <= … < 4 years
A74 : 4 <= … < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
label: (1 = Good, 2 = Bad)
特征预处理
这里按照一定方法处理成24个数值特征,主要是把每个特征离散为几个数值属性方便处理。处理后的数据如下:
spark运行
启动Hadoop,将数据上传到hdfs user目录下:
hdfs dfs -put Desktop/german_numeric.csv /user
启动spark。启动完后还需要配置下pyspark的环境,配置过程见我的另一篇博客.然后通过jupyter notebook去连接spark:
#连接spark
import os
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))
如果连接成功,会出现如下图所示:
下面就是使用Spark Mllib的一些机器学习方法处理数据,这里使用了逻辑回归。
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
import numpy
#从hdfs加载进csv文件并进行数据集划分,80%用于训练,20%用于测试
data = sc.textFile("hdfs:///user/german_numeric.csv")
train, test = data.randomSplit([0.8, 0.2], seed=12345)
#print test.count(),train.count()
#182 818
#将csv文件转变成'LabeledPoint'格式数据
def parsePoint(line):
values = [float(x.strip()) for x in line.split(',')]
#这里需要注意下,标签必须从0开始,否则会出错
if values[-1]==1:
values[-1]=0
else:
values[-1]=1
return LabeledPoint(values[-1],values[:24])
train_parsed = train.map(parsePoint)
test_parsed = test.map(parsePoint)
#print train_parsed.first()
#(0.0,[1.0,6.0,4.0,12.0,5.0,5.0,3.0,4.0,1.0,67.0,3.0,2.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0])
# 建立模型
model = LogisticRegressionWithLBFGS.train(train_parsed, iterations=10,numClasses=2)
#评估误差率
labelsAndPreds = train_parsed.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(train_parsed.count())
#print("Training Error = " + str(trainErr))
#Training Error = 0.213936430318
#对测试数据做预估
predictions = model.predict(test_parsed.map(lambda x: x.features))
#print predictions.collect()
#[1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
整个的流程就如上述所说,每个部分可以再做详细的填充。