统计机器学习导论第二章答案

本文通过R语言对College和Auto数据集进行深入分析,包括数据读取、描述性统计、散点图矩阵、箱线图及直方图展示。发现学生申请、入学与学费之间的关系,以及高校师资配置和留学费用的特点。同时,针对Auto数据集,探讨了各预测变量的取值范围、均值和标准差,揭示了变量间的相关性,如mpg与horsepower、weight的关系,并建立了mpg的预测模型。
摘要由CSDN通过智能技术生成

R语言学习笔记

统计机器学习导论第二章部分习题


一、8题

8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are

(a)Use the read.csv() function to read the data into R. Call the loaded data college

#college=read.csv ("College.csv", header =T,na.strings ="?")
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")

(b)Look at the data using the fix() function.

fix(college)
rownames (college)=college [,1]

©
i. Use the summary() function to produce a numerical summary of the variables in the data set.

summary(college)

ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

Private=as.factor (college$Private)
college=data.frame(Private,college[,2:11])
#pairs(college,main="矩阵散点图")
pairs(~Apps+Accept+Enroll+Top10perc+Top25perc+F.Undergrad+P.Undergrad+Outstate,panel = panel.smooth,data=college,main="矩阵散点图")

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(Private,college$Outstate,ylab="Outstate")

iv. Use the summary() function to see how many elite universities there are.

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]=" Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)
summary(college)

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

plot(Elite,college$Outstate,ylab="Outstate")

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.


par(mfrow=c(3,3))
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")
name=colnames(college) #提取列名
for(i in 3:19){  #十七个定量变量的频数直方图
  hist(college[,i],col =2, breaks =20,xlab=name[i],main="频数直方图")
}

vi. Continue exploring the data, and provide a brief summary of what you discover.
1.学生情况:申请人数、接受申请数、入学人数和本科生人数线性正相关;高中班前10%的新生和前25%的新生呈现非线性相关;申请留学的学生大部分成绩都没有达到前25%,前25%的学生人数只占每个学校申人数的50个左右。

2.师资配置:各个高校拥有博士学位的教员百分比呈现左偏分布,大部分学校拥有博士学位的导师占比在80%左右,少部分高校拥有博士学位的导师占比不足50%。大部分学校的学生比教员的比率在15%左右,师资力量好。

3.留学费用:住宿费用4300左右,书费500左右,个人支出1300左右,可见留学费用中住宿支出占比最大;私立学校学费远远高公立学校,其学费波动程度也稍大于公立学校学费波动程度,仅有极少数的公立学校学费高于私立学校;私立学校成绩排前10%的新生学费高于非前10%的新生。

二、9题

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Auto=read.table ("E:/大三下学期/机器学习/jiqixuxi/data/Auto.data", header =T,na.strings ="?")
fix(Auto)
Auto=na.omit(Auto)
#dim(Auto)

(a) Which of the predictors are quantitative, and which are qualitative?

names(Auto)
summary(Auto)

cylinders和origin是定性变量,其他变量均为定量变量

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

© What is the mean and standard deviation of each quantitative predictor?

len=matrix(0,8,4)
for(l in 1:8){
  len[l,]=range(Auto[,l])#变量取值范围
  len[l,3]=sd(Auto[,l])#变量标准差
  len[l,4]=mean((Auto[,l]))#变量均值
}

name=matrix(names(Auto[,1:8]),8,1)#提取变量名
len=cbind(name,len)#组合数据表
len=data.frame(len)
names(len)=c("变量名","最小值","最大值","标准差","均值")
len

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto2=Auto[-c(10,85),]
dim(Auto2)
len2=matrix(0,8,4)
for(l in 1:8){
  len2[l,]=range(Auto2[,l])
  len2[l,3]=sd(Auto2[,l])
  len2[l,4]=mean((Auto2[,l]))
}
name2=matrix(names(Auto[,1:8]),8,1)
len2=cbind(name2,len2)
len2=data.frame(len2)
names(len2)=c("变量名","最小值","最大值","标准差","均值")
len2

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto[,1:8],main="Auto's matrix scatter plot")#矩阵散点图查看大致相关情况
#pairs(Auto[,1:7],main="Auto's matrix scatter plot")

displacement与horsepower、weight呈现线性正相关,与acceleration呈现负相关;
horsepower与weight呈现线性,与acceleration、year呈现负相关正相关;
mpg与horsepower、weight、acceleration呈现正相关,与acceleration呈现负相关。

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

origin=as.factor(Auto$origin)
Auto3=data.frame(Auto[,1:7],origin)
library(GGally)
library(ggplot2)
ggpairs(Auto3, columns=1:8, aes(color=origin)) + ggtitle("matrix scatter plot-Auto)")+theme_bw()
Auto4=data.frame(Auto[,1:7],origin)
name4=names(Auto4)
name4
fm=lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year,Auto4)
summary(fm)
lm.step=step(fm,direction = 'backward')
#lm.step2=step(fm,direction = 'both')
fm=lm(mpg~weight+year,Auto4)
fm

mpg与wight成负相关和year呈现正相关。
自变量选择对应的AIC最小的值是968.66,wight和year的回归系数是-0.006632 0.757318
回归方程是 mpg=-14.347253-0.006632wight+0.757318year

总结

以上均为个人观点,由于个人能力有限,难免有差错,还请多多指教

  • 1
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值