R语言学习笔记
统计机器学习导论第二章部分习题
一、8题
8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are
(a)Use the read.csv() function to read the data into R. Call the loaded data college
#college=read.csv ("College.csv", header =T,na.strings ="?")
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")
(b)Look at the data using the fix() function.
fix(college)
rownames (college)=college [,1]
©
i. Use the summary() function to produce a numerical summary of the variables in the data set.
summary(college)
ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.
Private=as.factor (college$Private)
college=data.frame(Private,college[,2:11])
#pairs(college,main="矩阵散点图")
pairs(~Apps+Accept+Enroll+Top10perc+Top25perc+F.Undergrad+P.Undergrad+Outstate,panel = panel.smooth,data=college,main="矩阵散点图")
iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.
plot(Private,college$Outstate,ylab="Outstate")
iv. Use the summary() function to see how many elite universities there are.
Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]=" Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)
summary(college)
Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.
plot(Elite,college$Outstate,ylab="Outstate")
v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.
par(mfrow=c(3,3))
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")
name=colnames(college) #提取列名
for(i in 3:19){ #十七个定量变量的频数直方图
hist(college[,i],col =2, breaks =20,xlab=name[i],main="频数直方图")
}
vi. Continue exploring the data, and provide a brief summary of what you discover.
1.学生情况:申请人数、接受申请数、入学人数和本科生人数线性正相关;高中班前10%的新生和前25%的新生呈现非线性相关;申请留学的学生大部分成绩都没有达到前25%,前25%的学生人数只占每个学校申人数的50个左右。
2.师资配置:各个高校拥有博士学位的教员百分比呈现左偏分布,大部分学校拥有博士学位的导师占比在80%左右,少部分高校拥有博士学位的导师占比不足50%。大部分学校的学生比教员的比率在15%左右,师资力量好。
3.留学费用:住宿费用4300左右,书费500左右,个人支出1300左右,可见留学费用中住宿支出占比最大;私立学校学费远远高公立学校,其学费波动程度也稍大于公立学校学费波动程度,仅有极少数的公立学校学费高于私立学校;私立学校成绩排前10%的新生学费高于非前10%的新生。
二、9题
9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
Auto=read.table ("E:/大三下学期/机器学习/jiqixuxi/data/Auto.data", header =T,na.strings ="?")
fix(Auto)
Auto=na.omit(Auto)
#dim(Auto)
(a) Which of the predictors are quantitative, and which are qualitative?
names(Auto)
summary(Auto)
cylinders和origin是定性变量,其他变量均为定量变量
(b) What is the range of each quantitative predictor? You can answer this using the range() function.
© What is the mean and standard deviation of each quantitative predictor?
len=matrix(0,8,4)
for(l in 1:8){
len[l,]=range(Auto[,l])#变量取值范围
len[l,3]=sd(Auto[,l])#变量标准差
len[l,4]=mean((Auto[,l]))#变量均值
}
name=matrix(names(Auto[,1:8]),8,1)#提取变量名
len=cbind(name,len)#组合数据表
len=data.frame(len)
names(len)=c("变量名","最小值","最大值","标准差","均值")
len
(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
Auto2=Auto[-c(10,85),]
dim(Auto2)
len2=matrix(0,8,4)
for(l in 1:8){
len2[l,]=range(Auto2[,l])
len2[l,3]=sd(Auto2[,l])
len2[l,4]=mean((Auto2[,l]))
}
name2=matrix(names(Auto[,1:8]),8,1)
len2=cbind(name2,len2)
len2=data.frame(len2)
names(len2)=c("变量名","最小值","最大值","标准差","均值")
len2
(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
pairs(Auto[,1:8],main="Auto's matrix scatter plot")#矩阵散点图查看大致相关情况
#pairs(Auto[,1:7],main="Auto's matrix scatter plot")
displacement与horsepower、weight呈现线性正相关,与acceleration呈现负相关;
horsepower与weight呈现线性,与acceleration、year呈现负相关正相关;
mpg与horsepower、weight、acceleration呈现正相关,与acceleration呈现负相关。
(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
origin=as.factor(Auto$origin)
Auto3=data.frame(Auto[,1:7],origin)
library(GGally)
library(ggplot2)
ggpairs(Auto3, columns=1:8, aes(color=origin)) + ggtitle("matrix scatter plot-Auto)")+theme_bw()
Auto4=data.frame(Auto[,1:7],origin)
name4=names(Auto4)
name4
fm=lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year,Auto4)
summary(fm)
lm.step=step(fm,direction = 'backward')
#lm.step2=step(fm,direction = 'both')
fm=lm(mpg~weight+year,Auto4)
fm
mpg与wight成负相关和year呈现正相关。
自变量选择对应的AIC最小的值是968.66,wight和year的回归系数是-0.006632 0.757318
回归方程是 mpg=-14.347253-0.006632wight+0.757318year
总结
以上均为个人观点,由于个人能力有限,难免有差错,还请多多指教