回归树过程:
例:有10名学生,他们的身高分布如下:
R1:
女生(7):156,167,165,163,160,170,160
R2:
男生(3):172,180,176
那么,落入R1的样本均值为163,落入R2的样本均值为176,那么对于新样本,如果是女生,树模型预测的身高是163,是男生,则为176.
那么如何划分出区域R1,R2(建造树模型)?
需要使用自上到下的贪婪算法—–递归二元分割,即从根节点逐步向下分隔,每次产生两个树枝(二元分割)
R中可以建造回归树的包:ctree,rpart,tree
> library(rpart)
> library(tree)
Error in library(tree) : 不存在叫‘tree’这个名字的程辑包
> install.packages("tree")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/tree_1.0-37.zip'
Content type 'application/zip' length 122090 bytes (119 KB)
downloaded 119 KB
package ‘tree’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\LLJiang\AppData\Local\Temp\RtmpmMgvpx\downloaded_packages
> library(tree)
Warning message:
程辑包‘tree’是用R版本3.4.3 来建造的
> dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> dat=subset(dat,store_exp>0&online_exp>0)
> trainx=dat[,grep("Q",names(dat))]
> trainy=dat$store_exp+dat$online_exp
> set.seed(100)
> rpartTrue=train(trainx,trainy,method="rpart2",tuneLength=10,trControl=trainControl(method="cv"))
> plot(rpartTrue)
>
如上图,树的最大深度大于2,RMSE就不再变化了,这里我们就用深度2来建立树
> rpartTrue=rpart(trainy~.,data=trainx,maxdepth=2)
> print(rpartTrue)
n= 999
node), split, n, deviance, yval
* denotes terminal node
1) root 999 15812720000 3479.113
2) Q3< 3.5 799 2373688000 1818.720
4) Q5< 1.5 250 3534392 705.193 *
5) Q5><