决策树

1. 使用party包构建决策树

       决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。
       由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度,使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。
       决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。分类树(决策树)是一种十分常用的分类方法。他是一种监管学习,所谓监管学习就是给定一堆样本,每个样本都有一组属性和一个类别,这些类别是事先确定的,那么通过学习得到一个分类器,这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。

   在建模之前,将iris数据集划分为两个子集:其中70%的数据用于训练,剩下的30%用于测试。为了获得可重现的结果,随机种子设定为固定值。

> str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> set.seed(1234)
> ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
> trainData <- iris[ind ==1,]
> testData <- iris[ind==2,]

加载party包并建立一颗决策树,然后查看预测结果。myFormula指定了Species为目标变量,其余的所有变量为自变量。

> library(party)
载入需要的程辑包:grid
载入需要的程辑包:mvtnorm
载入需要的程辑包:modeltools
载入需要的程辑包:stats4
载入需要的程辑包:strucchange
载入需要的程辑包:zoo

载入程辑包:‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

载入需要的程辑包:sandwich
Warning messages:
1: 程辑包‘party’是用R版本3.3.3 来建造的 
2: 程辑包‘mvtnorm’是用R版本3.3.3 来建造的 
3: 程辑包‘modeltools’是用R版本3.3.2 来建造的 
4: 程辑包‘strucchange’是用R版本3.3.3 来建造的 
5: 程辑包‘zoo’是用R版本3.3.3 来建造的 
6: 程辑包‘sandwich’是用R版本3.3.3 来建造的 
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> iris_ctree <- ctree(myFormula, data = trainData)
> iris_ctree <- ctree(myFormula, data = trainData)
> iris_ctree

	 Conditional inference tree with 4 terminal nodes

Response:  Species 
Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
Number of observations:  112 

1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643
  2)*  weights = 40 
1) Petal.Length > 1.9
  3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939
    4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397
      5)*  weights = 21 
    4) Petal.Length > 4.4
      6)*  weights = 19 
  3) Petal.Width > 1.7
> # check the prediction
> table(predict(iris_ctree), trainData$Species)
            
             setosa versicolor virginica
  setosa         40          0         0
  versicolor      0         37         3
  virginica       0          1        31
> plot(iris_ctree)

> plot(iris_ctree, type = "simple")

> # predict on test data
> testPred <- predict(iris_ctree, newdata = testData)
> table(testPred, testData$Species)
            
testPred     setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         12         2
  virginica       0          0        14

2. 使用rpart包构建决策树

> data("bodyfat", package = "THdata")
Error in find.package(package, lib.loc, verbose = verbose) : 
  there is no package called ‘THdata’
> data("bodyfat", package = "TH.data")
> dim(bodyfat)
[1] 71 10
> attributes(bodyfat)
$names
 [1] "age"          "DEXfat"       "waistcirc"    "hipcirc"     
 [5] "elbowbreadth" "kneebreadth"  "anthro3a"     "anthro3b"    
 [9] "anthro3c"     "anthro4"     

$row.names
 [1] "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56" 
[11] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66" 
[21] "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76" 
[31] "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86" 
[41] "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
[51] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106"
[61] "107" "108" "109" "110" "111" "112" "113" "114" "115" "116"
[71] "117"

$class
[1] "data.frame"

> bodyfat[1:5,]
   age DEXfat waistcirc hipcirc elbowbreadth kneebreadth
47  57  41.68     100.0   112.0          7.1         9.4
48  65  43.29      99.5   116.5          6.5         8.9
49  59  35.41      96.0   108.5          6.2         8.9
50  58  22.79      72.0    96.5          6.1         9.2
51  60  36.42      89.5   100.5          7.1        10.0
   anthro3a anthro3b anthro3c anthro4
47     4.42     4.95     4.50    6.13
48     4.63     5.01     4.48    6.37
49     4.12     4.74     4.60    5.82
50     4.03     4.48     3.91    5.66
51     4.24     4.68     4.15    5.91

将bodyfat数据集分为训练集和测试集两部分,并在训练集上构建决策树。

> set.seed(1234)
>  ind <- sample(2, nrow(bodyfat), replace = TRUE, prob = c(0.7, 0.3)) 
>  bodyfat.train <- bodyfat[ind ==1,]
>  bodyfat.test <- bodyfat[ind==2,]
> # train a decision tree
> library(rpart)
Warning message:
程辑包‘rpart’是用R版本3.3.3 来建造的 
> myFormula <- DEXfat ~age + waistcirc + hipcirc + elbowbreadth + kneebreadth
> bodyfat_rpart <- rpart(myFormula, data = bodyfat.train, control = rpart.control(minsplit = 10))
> plot(bodyfat_rpart)
> bodyfat_rpart
n= 56 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 56 7265.0290000 30.94589  
   2) waistcirc< 88.4 31  960.5381000 22.55645  
     4) hipcirc< 96.25 14  222.2648000 18.41143  
       8) age< 60.5 9   66.8809600 16.19222 *
       9) age>=60.5 5   31.2769200 22.40600 *
     5) hipcirc>=96.25 17  299.6470000 25.97000  
      10) waistcirc< 77.75 6   30.7345500 22.32500 *
      11) waistcirc>=77.75 11  145.7148000 27.95818  
        22) hipcirc< 99.5 3    0.2568667 23.74667 *
        23) hipcirc>=99.5 8   72.2933500 29.53750 *
   3) waistcirc>=88.4 25 1417.1140000 41.34880  
     6) waistcirc< 104.75 18  330.5792000 38.09111  
      12) hipcirc< 109.9 9   68.9996200 34.37556 *
      13) hipcirc>=109.9 9   13.0832000 41.80667 *
     7) waistcirc>=104.75 7  404.3004000 49.72571 *
> text(bodyfat_rpart,use.n = T)

选择最小预测误差的决策树

> opt <- which.min(bodyfat_rpart$cptable[,"xerror"])
> cp <- bodyfat_rpart$cptable[opt, "CP"]
> bodyfat_prune <- prune(bodyfat_rpart, cp = cp)
> bodyfat_prune
n= 56 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 56 7265.02900 30.94589  
   2) waistcirc< 88.4 31  960.53810 22.55645  
     4) hipcirc< 96.25 14  222.26480 18.41143  
       8) age< 60.5 9   66.88096 16.19222 *
       9) age>=60.5 5   31.27692 22.40600 *
     5) hipcirc>=96.25 17  299.64700 25.97000  
      10) waistcirc< 77.75 6   30.73455 22.32500 *
      11) waistcirc>=77.75 11  145.71480 27.95818 *
   3) waistcirc>=88.4 25 1417.11400 41.34880  
     6) waistcirc< 104.75 18  330.57920 38.09111  
      12) hipcirc< 109.9 9   68.99962 34.37556 *
      13) hipcirc>=109.9 9   13.08320 41.80667 *
     7) waistcirc>=104.75 7  404.30040 49.72571 *
> plot(bodyfat_prune)
> text(bodyfat, use.n = T)

使用前面的选出的决策树进行预测,并将预测结果与真实数据进行对比。

> DEXfat_pred <- predict(bodyfat_prune, newdata = bodyfat.test)
> xlim <- range(bodyfat$DEXfat)
> plot(DEXfat_pred ~ DEXfat, data = bodyfat.test, xlab = "Observed", ylab = "Predicted", ylim = xlim, xlim = xlim)
> abline(a=0, b= 1)

一个好的预测模型得到的预测值应该等于或接近真实值,即绝大多数点应该尽可能的在对角线上或者接近对角线。


阅读更多
个人分类: R语言与数据挖掘
上一篇数据更多探索
下一篇随机森林
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭