R语言实验:决策树分类

 

1.数据预处理

  • 数据清洗

                   缺失值处理:删除法。

setwd("G:/!!aaclassnew/R/20181025")
data=read.csv(file = "bank-data.csv",header = TRUE)
View(data)
n=sum(is.na(data))#缺失值个数
print(n)
sub=which(is.na(data$income))#缺失值所在行数(456-461)
print(sub)
new_data=data[-sub,]#缺失值处理
print(new_data)

  • 数据集成

              去除无用属性:删除“ID”属性。

new_data=new_data[,-1]#删除第二列[,2];删除第二行[2,]
print(new_data)

  • 数据变换       

                  离散化:把“Children”属性转换成分类型的两个值“YES”和“NO”;把income属性按照节点12640.3;17390.1;29622;43228.2离散化。

for (i in 1:594) {
  if(new_data$children[i]==0){
    new_data$children[i]="NO";
  }else{
    new_data$children[i]="YES";
  }
}
print(new_data)

for(i in 1:594){
  if(new_data$income[i]<12640.3){
    new_data$income[i]=1;
  }else if(new_data$income[i]<17390.1){
    new_data$income[i]=2;
  }else if(new_data$income[i]<29622){
    new_data$income[i]=3;
  }else if(new_data$income[i]<=43228.2){
    new_data$income[i]=4;
  }else{
    new_data$income[i]=5;
  }
}
print(new_data)

2. 决策树:数据将“bank-data.csv”文件的600条数据中前500条数据作为训练数据集,并保存为文件;后100条数据作为测试数据集。

practice_data=new_data[c(1:500),]
print(practice_data)
write.csv(practice_data,file = "practice_datafile.csv",row.names = FALSE)

test_data=new_data[c(501:594),]
print(test_data)

  • 编程实现每一步节点的选择,然后用软件(VISIO,PS或者直接手画拍照均可,推荐用VISIO)画出建立的决策树模型
data2=read.csv(file = "practice_datafile.csv",header = TRUE)
View(data2)
library(rpart)
fit=rpart(income ~ age+sex+region+married+children+car+save_act+current_act+mortgage+pep,method = "class",
          data=data2,control = rpart.control(minsplit = 1) , parms = list(split="information"))
print(fit)

 

> summary(fit)
Call:
rpart(formula = income ~ age + sex + region + married + children + 
    car + save_act + current_act + mortgage + pep, data = data2, 
    method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 1))
  n= 500 

          CP nsplit rel error    xerror       xstd
1 0.04394427      0 1.0000000 1.0000000 0.03486308
2 0.02572347      3 0.8681672 0.8971061 0.03570696
3 0.02411576      4 0.8424437 0.8810289 0.03578361
4 0.01768489      6 0.7942122 0.8745981 0.03581018
5 0.01286174      8 0.7588424 0.9003215 0.03568987
6 0.01125402     12 0.7073955 0.9035370 0.03567219
7 0.01000000     14 0.6848875 0.9067524 0.03565393

Variable importance
     age save_act children      pep   region  married      car      sex mortgage 
      71        9        7        5        3        2        1        1        1 

Node number 1: 500 observations,    complexity param=0.04394427
  predicted class=3  expected loss=0.622  P(node) =1
    class counts:    45    77   189   119    70
   probabilities: 0.090 0.154 0.378 0.238 0.140 
  left son=2 (194 obs) right son=3 (306 obs)
  Primary splits:
      age      < 37.5 to the left,  improve=121.585200, (0 missing)
      save_act splits as  LR,       improve= 32.714180, (0 missing)
      pep      splits as  LR,       improve= 16.722620, (0 missing)
      car      splits as  LR,       improve=  7.492084, (0 missing)
      region   splits as  LRLL,     improve=  4.601014, (0 missing)
  Surrogate splits:
      save_act splits as  LR, agree=0.622, adj=0.026, (0 split)

Node number 2: 194 observations,    complexity param=0.02411576
  predicted class=3  expected loss=0.5721649  P(node) =0.388
    class counts:    45    55    83    11     0
   probabilities: 0.232 0.284 0.428 0.057 0.000 
  left son=4 (127 obs) right son=5 (67 obs)
  Primary splits:
      age         < 30.5 to the left,  improve=25.331970, (0 missing)
      pep         splits as  LR,       improve= 6.907268, (0 missing)
      car         splits as  LR,       improve= 4.373355, (0 missing)
      region      splits as  LRLR,     improve= 3.342221, (0 missing)
      current_act splits as  LR,       improve= 1.781451, (0 missing)
  Surrogate splits:
      region splits as  LLLR, agree=0.66, adj=0.015, (0 split)

Node number 3: 306 observations,    complexity param=0.04394427
  predicted class=4  expected loss=0.6470588  P(node) =0.612
    class counts:     0    22   106   108    70
   probabilities: 0.000 0.072 0.346 0.353 0.229 
  left son=6 (139 obs) right son=7 (167 obs)
  Primary splits:
      age      < 50.5 to the left,  improve=34.858180, (0 missing)
      save_act splits as  LR,       improve=24.427330, (0 missing)
      pep      splits as  LR,       improve= 8.277378, (0 missing)
      region   splits as  LRLL,     improve= 6.204070, (0 missing)
      car      splits as  LR,       improve= 2.411510, (0 missing)
  Surrogate splits:
      region   splits as  RRLL, agree=0.585, adj=0.086, (0 split)
      sex      splits as  RL,   agree=0.578, adj=0.072, (0 split)
      save_act splits as  LR,   agree=0.562, adj=0.036, (0 split)
      pep      splits as  LR,   agree=0.559, adj=0.029, (0 split)
      mortgage splits as  RL,   agree=0.549, adj=0.007, (0 split)

Node number 4: 127 observations,    complexity param=0.02411576
  predicted class=2  expected loss=0.6377953  P(node) =0.254
    class counts:    39    46    42     0     0
   probabilities: 0.307 0.362 0.331 0.000 0.000 
  left son=8 (80 obs) right son=9 (47 obs)
  Primary splits:
      age         < 25.5 to the left,  improve=7.949958, (0 missing)
      pep         splits as  LR,       improve=4.822180, (0 missing)
      car         splits as  RL,       improve=2.153993, (0 missing)
      region      splits as  RLLL,     improve=1.223215, (0 missing)
      current_act splits as  LR,       improve=1.106185, (0 missing)

Node number 5: 67 observations
  predicted class=3  expected loss=0.3880597  P(node) =0.134
    class counts:     6     9    41    11     0
   probabilities: 0.090 0.134 0.612 0.164 0.000 

Node number 6: 139 observations,    complexity param=0.02572347
  predicted class=3  expected loss=0.5539568  P(node) =0.278
    class counts:     0    19    62    52     6
   probabilities: 0.000 0.137 0.446 0.374 0.043 
  left son=12 (54 obs) right son=13 (85 obs)
  Primary splits:
      age      < 42.5 to the left,  improve=6.925978, (0 missing)
      region   splits as  LRLL,     improve=5.217350, (0 missing)
      mortgage splits as  LR,       improve=5.208289, (0 missing)
      save_act splits as  LR,       improve=5.032890, (0 missing)
      pep      splits as  LR,       improve=3.223694, (0 missing)
  Surrogate splits:
      save_act splits as  LR, agree=0.662, adj=0.13, (0 split)

Node number 7: 167 observations,    complexity param=0.04394427
  predicted class=5  expected loss=0.6167665  P(node) =0.334
    class counts:     0     3    44    56    64
   probabilities: 0.000 0.018 0.263 0.335 0.383 
  left son=14 (36 obs) right son=15 (131 obs)
  Primary splits:
      save_act    splits as  LR,       improve=20.568730, (0 missing)
      age         < 62.5 to the left,  improve=10.528110, (0 missing)
      pep         splits as  LR,       improve= 6.160460, (0 missing)
      children    splits as  LR,       improve= 3.415047, (0 missing)
      current_act splits as  LR,       improve= 2.966695, (0 missing)

Node number 8: 80 observations,    complexity param=0.01286174
  predicted class=1  expected loss=0.6  P(node) =0.16
    class counts:    32    31    17     0     0
   probabilities: 0.400 0.388 0.213 0.000 0.000 
  left son=16 (53 obs) right son=17 (27 obs)
  Primary splits:
      pep         splits as  LR,       improve=4.625912, (0 missing)
      region      splits as  RLLL,     improve=3.142163, (0 missing)
      married     splits as  LR,       improve=2.899320, (0 missing)
      age         < 19.5 to the left,  improve=2.470894, (0 missing)
      current_act splits as  LR,       improve=1.816017, (0 missing)

Node number 9: 47 observations
  predicted class=3  expected loss=0.4680851  P(node) =0.094
    class counts:     7    15    25     0     0
   probabilities: 0.149 0.319 0.532 0.000 0.000 

Node number 12: 54 observations
  predicted class=3  expected loss=0.3888889  P(node) =0.108
    class counts:     0     6    33    15     0
   probabilities: 0.000 0.111 0.611 0.278 0.000 

Node number 13: 85 observations
  predicted class=4  expected loss=0.5647059  P(node) =0.17
    class counts:     0    13    29    37     6
   probabilities: 0.000 0.153 0.341 0.435 0.071 

Node number 14: 36 observations,    complexity param=0.01286174
  predicted class=4  expected loss=0.4166667  P(node) =0.072
    class counts:     0     1    14    21     0
   probabilities: 0.000 0.028 0.389 0.583 0.000 
  left son=28 (17 obs) right son=29 (19 obs)
  Primary splits:
      car      splits as  LR,       improve=3.958279, (0 missing)
      children splits as  LR,       improve=3.042364, (0 missing)
      age      < 52.5 to the left,  improve=2.290985, (0 missing)
      sex      splits as  RL,       improve=1.218503, (0 missing)
      pep      splits as  LR,       improve=1.171938, (0 missing)
  Surrogate splits:
      married     splits as  RL,       agree=0.639, adj=0.235, (0 split)
      age         < 56.5 to the left,  agree=0.611, adj=0.176, (0 split)
      sex         splits as  RL,       agree=0.611, adj=0.176, (0 split)
      current_act splits as  RL,       agree=0.611, adj=0.176, (0 split)
      region      splits as  RLRL,     agree=0.583, adj=0.118, (0 split)

Node number 15: 131 observations,    complexity param=0.01768489
  predicted class=5  expected loss=0.5114504  P(node) =0.262
    class counts:     0     2    30    35    64
   probabilities: 0.000 0.015 0.229 0.267 0.489 
  left son=30 (62 obs) right son=31 (69 obs)
  Primary splits:
      pep         splits as  LR,       improve=8.469909, (0 missing)
      age         < 61.5 to the left,  improve=8.012590, (0 missing)
      current_act splits as  LR,       improve=3.684888, (0 missing)
      region      splits as  LLRL,     improve=2.561434, (0 missing)
      mortgage    splits as  RL,       improve=2.195836, (0 missing)
  Surrogate splits:
      children splits as  LR,       agree=0.733, adj=0.435, (0 split)
      mortgage splits as  RL,       agree=0.618, adj=0.194, (0 split)
      married  splits as  RL,       agree=0.580, adj=0.113, (0 split)
      age      < 61.5 to the left,  agree=0.557, adj=0.065, (0 split)
      region   splits as  LRRR,     agree=0.550, adj=0.048, (0 split)

Node number 16: 53 observations,    complexity param=0.01125402
  predicted class=1  expected loss=0.5283019  P(node) =0.106
    class counts:    25    22     6     0     0
   probabilities: 0.472 0.415 0.113 0.000 0.000 
  left son=32 (19 obs) right son=33 (34 obs)
  Primary splits:
      children    splits as  LR,       improve=3.6638990, (0 missing)
      current_act splits as  LR,       improve=2.6320070, (0 missing)
      age         < 21.5 to the left,  improve=2.5513420, (0 missing)
      region      splits as  RRLL,     improve=2.1774120, (0 missing)
      mortgage    splits as  RL,       improve=0.9724508, (0 missing)
  Surrogate splits:
      age < 24.5 to the right, agree=0.66, adj=0.053, (0 split)

Node number 17: 27 observations,    complexity param=0.01286174
  predicted class=3  expected loss=0.5925926  P(node) =0.054
    class counts:     7     9    11     0     0
   probabilities: 0.259 0.333 0.407 0.000 0.000 
  left son=34 (10 obs) right son=35 (17 obs)
  Primary splits:
      married     splits as  LR,       improve=4.075579, (0 missing)
      region      splits as  RLRR,     improve=3.389759, (0 missing)
      current_act splits as  RL,       improve=2.009758, (0 missing)
      mortgage    splits as  LR,       improve=1.968611, (0 missing)
      age         < 24.5 to the left,  improve=1.913873, (0 missing)
  Surrogate splits:
      region   splits as  RRRL, agree=0.667, adj=0.1, (0 split)
      children splits as  LR,   agree=0.667, adj=0.1, (0 split)

Node number 28: 17 observations
  predicted class=3  expected loss=0.4117647  P(node) =0.034
    class counts:     0     1    10     6     0
   probabilities: 0.000 0.059 0.588 0.353 0.000 

Node number 29: 19 observations
  predicted class=4  expected loss=0.2105263  P(node) =0.038
    class counts:     0     0     4    15     0
   probabilities: 0.000 0.000 0.211 0.789 0.000 

Node number 30: 62 observations,    complexity param=0.01768489
  predicted class=3  expected loss=0.6451613  P(node) =0.124
    class counts:     0     2    22    17    21
   probabilities: 0.000 0.032 0.355 0.274 0.339 
  left son=60 (21 obs) right son=61 (41 obs)
  Primary splits:
      children    splits as  RL,       improve=11.301300, (0 missing)
      age         < 53   to the left,  improve= 4.503433, (0 missing)
      current_act splits as  LR,       improve= 4.345984, (0 missing)
      region      splits as  RLRL,     improve= 2.877549, (0 missing)
      car         splits as  RL,       improve= 2.330261, (0 missing)

Node number 31: 69 observations
  predicted class=5  expected loss=0.3768116  P(node) =0.138
    class counts:     0     0     8    18    43
   probabilities: 0.000 0.000 0.116 0.261 0.623 

Node number 32: 19 observations,    complexity param=0.01125402
  predicted class=2  expected loss=0.4210526  P(node) =0.038
    class counts:     8    11     0     0     0
   probabilities: 0.421 0.579 0.000 0.000 0.000 
  left son=64 (4 obs) right son=65 (15 obs)
  Primary splits:
      region      splits as  RRRL,     improve=4.2332330, (0 missing)
      age         < 21.5 to the left,  improve=1.9960510, (0 missing)
      sex         splits as  RL,       improve=0.8541775, (0 missing)
      current_act splits as  LR,       improve=0.4374060, (0 missing)
      car         splits as  LR,       improve=0.1404615, (0 missing)

Node number 33: 34 observations
  predicted class=1  expected loss=0.5  P(node) =0.068
    class counts:    17    11     6     0     0
   probabilities: 0.500 0.324 0.176 0.000 0.000 

Node number 34: 10 observations
  predicted class=1  expected loss=0.5  P(node) =0.02
    class counts:     5     4     1     0     0
   probabilities: 0.500 0.400 0.100 0.000 0.000 

Node number 35: 17 observations
  predicted class=3  expected loss=0.4117647  P(node) =0.034
    class counts:     2     5    10     0     0
   probabilities: 0.118 0.294 0.588 0.000 0.000 

Node number 60: 21 observations,    complexity param=0.01286174
  predicted class=3  expected loss=0.4761905  P(node) =0.042
    class counts:     0     1    11     9     0
   probabilities: 0.000 0.048 0.524 0.429 0.000 
  left son=120 (17 obs) right son=121 (4 obs)
  Primary splits:
      age      < 62   to the left,  improve=4.042513, (0 missing)
      region   splits as  RRLR,     improve=4.020326, (0 missing)
      sex      splits as  LR,       improve=1.593345, (0 missing)
      married  splits as  LR,       improve=1.350827, (0 missing)
      mortgage splits as  RL,       improve=1.033341, (0 missing)

Node number 61: 41 observations
  predicted class=5  expected loss=0.4878049  P(node) =0.082
    class counts:     0     1    11     8    21
   probabilities: 0.000 0.024 0.268 0.195 0.512 

Node number 64: 4 observations
  predicted class=1  expected loss=0  P(node) =0.008
    class counts:     4     0     0     0     0
   probabilities: 1.000 0.000 0.000 0.000 0.000 

Node number 65: 15 observations
  predicted class=2  expected loss=0.2666667  P(node) =0.03
    class counts:     4    11     0     0     0
   probabilities: 0.267 0.733 0.000 0.000 0.000 

Node number 120: 17 observations
  predicted class=3  expected loss=0.3529412  P(node) =0.034
    class counts:     0     1    11     5     0
   probabilities: 0.000 0.059 0.647 0.294 0.000 

Node number 121: 4 observations
  predicted class=4  expected loss=0  P(node) =0.008
    class counts:     0     0     0     4     0
   probabilities: 0.000 0.000 0.000 1.000 0.000

 

 

  • 然后根据该模型编程实现对测试数据的分类

 

 

 

 

 

  • 8
    点赞
  • 35
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值