1.数据预处理
- 数据清洗
缺失值处理:删除法。
setwd("G:/!!aaclassnew/R/20181025")
data=read.csv(file = "bank-data.csv",header = TRUE)
View(data)
n=sum(is.na(data))#缺失值个数
print(n)
sub=which(is.na(data$income))#缺失值所在行数(456-461)
print(sub)
new_data=data[-sub,]#缺失值处理
print(new_data)
- 数据集成
去除无用属性:删除“ID”属性。
new_data=new_data[,-1]#删除第二列[,2];删除第二行[2,]
print(new_data)
- 数据变换
离散化:把“Children”属性转换成分类型的两个值“YES”和“NO”;把income属性按照节点12640.3;17390.1;29622;43228.2离散化。
for (i in 1:594) {
if(new_data$children[i]==0){
new_data$children[i]="NO";
}else{
new_data$children[i]="YES";
}
}
print(new_data)
for(i in 1:594){
if(new_data$income[i]<12640.3){
new_data$income[i]=1;
}else if(new_data$income[i]<17390.1){
new_data$income[i]=2;
}else if(new_data$income[i]<29622){
new_data$income[i]=3;
}else if(new_data$income[i]<=43228.2){
new_data$income[i]=4;
}else{
new_data$income[i]=5;
}
}
print(new_data)
2. 决策树:数据将“bank-data.csv”文件的600条数据中前500条数据作为训练数据集,并保存为文件;后100条数据作为测试数据集。
practice_data=new_data[c(1:500),]
print(practice_data)
write.csv(practice_data,file = "practice_datafile.csv",row.names = FALSE)
test_data=new_data[c(501:594),]
print(test_data)
- 编程实现每一步节点的选择,然后用软件(VISIO,PS或者直接手画拍照均可,推荐用VISIO)画出建立的决策树模型
data2=read.csv(file = "practice_datafile.csv",header = TRUE)
View(data2)
library(rpart)
fit=rpart(income ~ age+sex+region+married+children+car+save_act+current_act+mortgage+pep,method = "class",
data=data2,control = rpart.control(minsplit = 1) , parms = list(split="information"))
print(fit)
> summary(fit)
Call:
rpart(formula = income ~ age + sex + region + married + children +
car + save_act + current_act + mortgage + pep, data = data2,
method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 1))
n= 500
CP nsplit rel error xerror xstd
1 0.04394427 0 1.0000000 1.0000000 0.03486308
2 0.02572347 3 0.8681672 0.8971061 0.03570696
3 0.02411576 4 0.8424437 0.8810289 0.03578361
4 0.01768489 6 0.7942122 0.8745981 0.03581018
5 0.01286174 8 0.7588424 0.9003215 0.03568987
6 0.01125402 12 0.7073955 0.9035370 0.03567219
7 0.01000000 14 0.6848875 0.9067524 0.03565393
Variable importance
age save_act children pep region married car sex mortgage
71 9 7 5 3 2 1 1 1
Node number 1: 500 observations, complexity param=0.04394427
predicted class=3 expected loss=0.622 P(node) =1
class counts: 45 77 189 119 70
probabilities: 0.090 0.154 0.378 0.238 0.140
left son=2 (194 obs) right son=3 (306 obs)
Primary splits:
age < 37.5 to the left, improve=121.585200, (0 missing)
save_act splits as LR, improve= 32.714180, (0 missing)
pep splits as LR, improve= 16.722620, (0 missing)
car splits as LR, improve= 7.492084, (0 missing)
region splits as LRLL, improve= 4.601014, (0 missing)
Surrogate splits:
save_act splits as LR, agree=0.622, adj=0.026, (0 split)
Node number 2: 194 observations, complexity param=0.02411576
predicted class=3 expected loss=0.5721649 P(node) =0.388
class counts: 45 55 83 11 0
probabilities: 0.232 0.284 0.428 0.057 0.000
left son=4 (127 obs) right son=5 (67 obs)
Primary splits:
age < 30.5 to the left, improve=25.331970, (0 missing)
pep splits as LR, improve= 6.907268, (0 missing)
car splits as LR, improve= 4.373355, (0 missing)
region splits as LRLR, improve= 3.342221, (0 missing)
current_act splits as LR, improve= 1.781451, (0 missing)
Surrogate splits:
region splits as LLLR, agree=0.66, adj=0.015, (0 split)
Node number 3: 306 observations, complexity param=0.04394427
predicted class=4 expected loss=0.6470588 P(node) =0.612
class counts: 0 22 106 108 70
probabilities: 0.000 0.072 0.346 0.353 0.229
left son=6 (139 obs) right son=7 (167 obs)
Primary splits:
age < 50.5 to the left, improve=34.858180, (0 missing)
save_act splits as LR, improve=24.427330, (0 missing)
pep splits as LR, improve= 8.277378, (0 missing)
region splits as LRLL, improve= 6.204070, (0 missing)
car splits as LR, improve= 2.411510, (0 missing)
Surrogate splits:
region splits as RRLL, agree=0.585, adj=0.086, (0 split)
sex splits as RL, agree=0.578, adj=0.072, (0 split)
save_act splits as LR, agree=0.562, adj=0.036, (0 split)
pep splits as LR, agree=0.559, adj=0.029, (0 split)
mortgage splits as RL, agree=0.549, adj=0.007, (0 split)
Node number 4: 127 observations, complexity param=0.02411576
predicted class=2 expected loss=0.6377953 P(node) =0.254
class counts: 39 46 42 0 0
probabilities: 0.307 0.362 0.331 0.000 0.000
left son=8 (80 obs) right son=9 (47 obs)
Primary splits:
age < 25.5 to the left, improve=7.949958, (0 missing)
pep splits as LR, improve=4.822180, (0 missing)
car splits as RL, improve=2.153993, (0 missing)
region splits as RLLL, improve=1.223215, (0 missing)
current_act splits as LR, improve=1.106185, (0 missing)
Node number 5: 67 observations
predicted class=3 expected loss=0.3880597 P(node) =0.134
class counts: 6 9 41 11 0
probabilities: 0.090 0.134 0.612 0.164 0.000
Node number 6: 139 observations, complexity param=0.02572347
predicted class=3 expected loss=0.5539568 P(node) =0.278
class counts: 0 19 62 52 6
probabilities: 0.000 0.137 0.446 0.374 0.043
left son=12 (54 obs) right son=13 (85 obs)
Primary splits:
age < 42.5 to the left, improve=6.925978, (0 missing)
region splits as LRLL, improve=5.217350, (0 missing)
mortgage splits as LR, improve=5.208289, (0 missing)
save_act splits as LR, improve=5.032890, (0 missing)
pep splits as LR, improve=3.223694, (0 missing)
Surrogate splits:
save_act splits as LR, agree=0.662, adj=0.13, (0 split)
Node number 7: 167 observations, complexity param=0.04394427
predicted class=5 expected loss=0.6167665 P(node) =0.334
class counts: 0 3 44 56 64
probabilities: 0.000 0.018 0.263 0.335 0.383
left son=14 (36 obs) right son=15 (131 obs)
Primary splits:
save_act splits as LR, improve=20.568730, (0 missing)
age < 62.5 to the left, improve=10.528110, (0 missing)
pep splits as LR, improve= 6.160460, (0 missing)
children splits as LR, improve= 3.415047, (0 missing)
current_act splits as LR, improve= 2.966695, (0 missing)
Node number 8: 80 observations, complexity param=0.01286174
predicted class=1 expected loss=0.6 P(node) =0.16
class counts: 32 31 17 0 0
probabilities: 0.400 0.388 0.213 0.000 0.000
left son=16 (53 obs) right son=17 (27 obs)
Primary splits:
pep splits as LR, improve=4.625912, (0 missing)
region splits as RLLL, improve=3.142163, (0 missing)
married splits as LR, improve=2.899320, (0 missing)
age < 19.5 to the left, improve=2.470894, (0 missing)
current_act splits as LR, improve=1.816017, (0 missing)
Node number 9: 47 observations
predicted class=3 expected loss=0.4680851 P(node) =0.094
class counts: 7 15 25 0 0
probabilities: 0.149 0.319 0.532 0.000 0.000
Node number 12: 54 observations
predicted class=3 expected loss=0.3888889 P(node) =0.108
class counts: 0 6 33 15 0
probabilities: 0.000 0.111 0.611 0.278 0.000
Node number 13: 85 observations
predicted class=4 expected loss=0.5647059 P(node) =0.17
class counts: 0 13 29 37 6
probabilities: 0.000 0.153 0.341 0.435 0.071
Node number 14: 36 observations, complexity param=0.01286174
predicted class=4 expected loss=0.4166667 P(node) =0.072
class counts: 0 1 14 21 0
probabilities: 0.000 0.028 0.389 0.583 0.000
left son=28 (17 obs) right son=29 (19 obs)
Primary splits:
car splits as LR, improve=3.958279, (0 missing)
children splits as LR, improve=3.042364, (0 missing)
age < 52.5 to the left, improve=2.290985, (0 missing)
sex splits as RL, improve=1.218503, (0 missing)
pep splits as LR, improve=1.171938, (0 missing)
Surrogate splits:
married splits as RL, agree=0.639, adj=0.235, (0 split)
age < 56.5 to the left, agree=0.611, adj=0.176, (0 split)
sex splits as RL, agree=0.611, adj=0.176, (0 split)
current_act splits as RL, agree=0.611, adj=0.176, (0 split)
region splits as RLRL, agree=0.583, adj=0.118, (0 split)
Node number 15: 131 observations, complexity param=0.01768489
predicted class=5 expected loss=0.5114504 P(node) =0.262
class counts: 0 2 30 35 64
probabilities: 0.000 0.015 0.229 0.267 0.489
left son=30 (62 obs) right son=31 (69 obs)
Primary splits:
pep splits as LR, improve=8.469909, (0 missing)
age < 61.5 to the left, improve=8.012590, (0 missing)
current_act splits as LR, improve=3.684888, (0 missing)
region splits as LLRL, improve=2.561434, (0 missing)
mortgage splits as RL, improve=2.195836, (0 missing)
Surrogate splits:
children splits as LR, agree=0.733, adj=0.435, (0 split)
mortgage splits as RL, agree=0.618, adj=0.194, (0 split)
married splits as RL, agree=0.580, adj=0.113, (0 split)
age < 61.5 to the left, agree=0.557, adj=0.065, (0 split)
region splits as LRRR, agree=0.550, adj=0.048, (0 split)
Node number 16: 53 observations, complexity param=0.01125402
predicted class=1 expected loss=0.5283019 P(node) =0.106
class counts: 25 22 6 0 0
probabilities: 0.472 0.415 0.113 0.000 0.000
left son=32 (19 obs) right son=33 (34 obs)
Primary splits:
children splits as LR, improve=3.6638990, (0 missing)
current_act splits as LR, improve=2.6320070, (0 missing)
age < 21.5 to the left, improve=2.5513420, (0 missing)
region splits as RRLL, improve=2.1774120, (0 missing)
mortgage splits as RL, improve=0.9724508, (0 missing)
Surrogate splits:
age < 24.5 to the right, agree=0.66, adj=0.053, (0 split)
Node number 17: 27 observations, complexity param=0.01286174
predicted class=3 expected loss=0.5925926 P(node) =0.054
class counts: 7 9 11 0 0
probabilities: 0.259 0.333 0.407 0.000 0.000
left son=34 (10 obs) right son=35 (17 obs)
Primary splits:
married splits as LR, improve=4.075579, (0 missing)
region splits as RLRR, improve=3.389759, (0 missing)
current_act splits as RL, improve=2.009758, (0 missing)
mortgage splits as LR, improve=1.968611, (0 missing)
age < 24.5 to the left, improve=1.913873, (0 missing)
Surrogate splits:
region splits as RRRL, agree=0.667, adj=0.1, (0 split)
children splits as LR, agree=0.667, adj=0.1, (0 split)
Node number 28: 17 observations
predicted class=3 expected loss=0.4117647 P(node) =0.034
class counts: 0 1 10 6 0
probabilities: 0.000 0.059 0.588 0.353 0.000
Node number 29: 19 observations
predicted class=4 expected loss=0.2105263 P(node) =0.038
class counts: 0 0 4 15 0
probabilities: 0.000 0.000 0.211 0.789 0.000
Node number 30: 62 observations, complexity param=0.01768489
predicted class=3 expected loss=0.6451613 P(node) =0.124
class counts: 0 2 22 17 21
probabilities: 0.000 0.032 0.355 0.274 0.339
left son=60 (21 obs) right son=61 (41 obs)
Primary splits:
children splits as RL, improve=11.301300, (0 missing)
age < 53 to the left, improve= 4.503433, (0 missing)
current_act splits as LR, improve= 4.345984, (0 missing)
region splits as RLRL, improve= 2.877549, (0 missing)
car splits as RL, improve= 2.330261, (0 missing)
Node number 31: 69 observations
predicted class=5 expected loss=0.3768116 P(node) =0.138
class counts: 0 0 8 18 43
probabilities: 0.000 0.000 0.116 0.261 0.623
Node number 32: 19 observations, complexity param=0.01125402
predicted class=2 expected loss=0.4210526 P(node) =0.038
class counts: 8 11 0 0 0
probabilities: 0.421 0.579 0.000 0.000 0.000
left son=64 (4 obs) right son=65 (15 obs)
Primary splits:
region splits as RRRL, improve=4.2332330, (0 missing)
age < 21.5 to the left, improve=1.9960510, (0 missing)
sex splits as RL, improve=0.8541775, (0 missing)
current_act splits as LR, improve=0.4374060, (0 missing)
car splits as LR, improve=0.1404615, (0 missing)
Node number 33: 34 observations
predicted class=1 expected loss=0.5 P(node) =0.068
class counts: 17 11 6 0 0
probabilities: 0.500 0.324 0.176 0.000 0.000
Node number 34: 10 observations
predicted class=1 expected loss=0.5 P(node) =0.02
class counts: 5 4 1 0 0
probabilities: 0.500 0.400 0.100 0.000 0.000
Node number 35: 17 observations
predicted class=3 expected loss=0.4117647 P(node) =0.034
class counts: 2 5 10 0 0
probabilities: 0.118 0.294 0.588 0.000 0.000
Node number 60: 21 observations, complexity param=0.01286174
predicted class=3 expected loss=0.4761905 P(node) =0.042
class counts: 0 1 11 9 0
probabilities: 0.000 0.048 0.524 0.429 0.000
left son=120 (17 obs) right son=121 (4 obs)
Primary splits:
age < 62 to the left, improve=4.042513, (0 missing)
region splits as RRLR, improve=4.020326, (0 missing)
sex splits as LR, improve=1.593345, (0 missing)
married splits as LR, improve=1.350827, (0 missing)
mortgage splits as RL, improve=1.033341, (0 missing)
Node number 61: 41 observations
predicted class=5 expected loss=0.4878049 P(node) =0.082
class counts: 0 1 11 8 21
probabilities: 0.000 0.024 0.268 0.195 0.512
Node number 64: 4 observations
predicted class=1 expected loss=0 P(node) =0.008
class counts: 4 0 0 0 0
probabilities: 1.000 0.000 0.000 0.000 0.000
Node number 65: 15 observations
predicted class=2 expected loss=0.2666667 P(node) =0.03
class counts: 4 11 0 0 0
probabilities: 0.267 0.733 0.000 0.000 0.000
Node number 120: 17 observations
predicted class=3 expected loss=0.3529412 P(node) =0.034
class counts: 0 1 11 5 0
probabilities: 0.000 0.059 0.647 0.294 0.000
Node number 121: 4 observations
predicted class=4 expected loss=0 P(node) =0.008
class counts: 0 0 0 4 0
probabilities: 0.000 0.000 0.000 1.000 0.000
- 然后根据该模型编程实现对测试数据的分类