install.packages("tree")
library(tree)
Pollute <- read.table("c:\\temp\\Pollute.txt",header=T)
attach(Pollute)
names(Pollute)
model <- tree(Pollute)
plot(model)
text(model)
The model is fitted using binary recursive partitioning,
whereby the data are successively split along coordinate axes of the explanatory variables so that, at any node,
the split which maximally distinguishes the response variable in the left and the right branches is selected.
Splitting continues until nodes are pure or the data are too sparse (fewer than six cases, by default; see Breiman et al., 1984).
Each explanatory variable is assessed in turn, and the variable explaining the greatest amount of the deviance in y is selected.
Deviance is calculated on the basis of a threshold in the explanatory variable; this threshold produces two mean values for the response (one mean above the threshold, the other below the threshold)
low <- (Industry<748)
tapply(Pollution,low,mean)
plot(Industry,Pollution,pch=16)
abline(v=748,lty=2)
lines(c(0,748),c(24.92,24.92))
lines(c(748,max(Industry)),c(67,67))
model <- tree(Pollution ~ . , Pollute)
print(model)
Pollute<-read.table("c:\\temp\\Pollute.txt",header=T)
attach(Pollute)
names(Pollute)
par(mfrow=c(1,2))
library(rpart)
model<-rpart(Pollution~.,data=Pollute)
plot(model)
text(model)
library(tree)
model<-tree(Pollute)
plot(model)
text(model)
t2<-factor(Temp>=56.25)
i2<-factor(Industry<597)
model<-lm(Pollution~t2*i2)
summary(model)
In summary, I prefer the tree function for data inspection, because it shows more detail about the potential
interaction structure in the dataframe. On the other hand, rpart is much better at anticipating the results of
model simplification. I recommend you use them both, and get the benefit of two perspectives on your data
set before embarking on the time-consuming business of carrying out a comprehensive multiple regression
exercise