统计机器学习导论第二章答案

最新推荐文章于 2021-04-25 16:19:53 发布

大媛子滴博客

最新推荐文章于 2021-04-25 16:19:53 发布

阅读量2.1k

点赞数 1

文章标签：机器学习数据挖掘概率论经验分享

本文链接：https://blog.csdn.net/weixin_46333910/article/details/115664984

版权

本文通过R语言对College和Auto数据集进行深入分析，包括数据读取、描述性统计、散点图矩阵、箱线图及直方图展示。发现学生申请、入学与学费之间的关系，以及高校师资配置和留学费用的特点。同时，针对Auto数据集，探讨了各预测变量的取值范围、均值和标准差，揭示了变量间的相关性，如mpg与horsepower、weight的关系，并建立了mpg的预测模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

R语言学习笔记

统计机器学习导论第二章部分习题

文章目录

R语言学习笔记
一、8题
- 8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are
二、9题
- 9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.
总结

一、8题

8. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are

(a)Use the read.csv() function to read the data into R. Call the loaded data college

#college=read.csv ("College.csv", header =T,na.strings ="?")
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")

(b)Look at the data using the fix() function.

fix(college)
rownames (college)=college [,1]

summary(college)

ii. Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data.

Private=as.factor (college$Private)
college=data.frame(Private,college[,2:11])
#pairs(college,main="矩阵散点图")
pairs(~Apps+Accept+Enroll+Top10perc+Top25perc+F.Undergrad+P.Undergrad+Outstate,panel = panel.smooth,data=college,main="矩阵散点图")

iii. Use the plot() function to produce side-by-side boxplots of Outstate versus Private.

plot(Private,college$Outstate,ylab="Outstate")

iv. Use the summary() function to see how many elite universities there are.

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

Elite=rep("No",nrow(college ))
Elite[college$Top10perc >50]=" Yes"
Elite=as.factor(Elite)
college=data.frame(college , Elite)
summary(college)

Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite.

plot(Elite,college$Outstate,ylab="Outstate")

v. Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables.


par(mfrow=c(3,3))
college=read.csv ("E:/大三下学期/机器学习/jiqixuxi/data/College.csv", header =T,na.strings ="?")
name=colnames(college) #提取列名
for(i in 3:19){  #十七个定量变量的频数直方图
  hist(college[,i],col =2, breaks =20,xlab=name[i],main="频数直方图")
}

vi. Continue exploring the data, and provide a brief summary of what you discover.
1.学生情况：申请人数、接受申请数、入学人数和本科生人数线性正相关；高中班前10%的新生和前25%的新生呈现非线性相关；申请留学的学生大部分成绩都没有达到前25%，前25%的学生人数只占每个学校申人数的50个左右。

2.师资配置：各个高校拥有博士学位的教员百分比呈现左偏分布，大部分学校拥有博士学位的导师占比在80%左右，少部分高校拥有博士学位的导师占比不足50%。大部分学校的学生比教员的比率在15%左右，师资力量好。

3.留学费用：住宿费用4300左右，书费500左右，个人支出1300左右，可见留学费用中住宿支出占比最大；私立学校学费远远高公立学校，其学费波动程度也稍大于公立学校学费波动程度，仅有极少数的公立学校学费高于私立学校；私立学校成绩排前10%的新生学费高于非前10%的新生。

二、9题

9. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been removed from the data.

Auto=read.table ("E:/大三下学期/机器学习/jiqixuxi/data/Auto.data", header =T,na.strings ="?")
fix(Auto)
Auto=na.omit(Auto)
#dim(Auto)

(a) Which of the predictors are quantitative, and which are qualitative?

names(Auto)
summary(Auto)

cylinders和origin是定性变量，其他变量均为定量变量

(b) What is the range of each quantitative predictor? You can answer this using the range() function.

len=matrix(0,8,4)
for(l in 1:8){
  len[l,]=range(Auto[,l])#变量取值范围
  len[l,3]=sd(Auto[,l])#变量标准差
  len[l,4]=mean((Auto[,l]))#变量均值
}

name=matrix(names(Auto[,1:8]),8,1)#提取变量名
len=cbind(name,len)#组合数据表
len=data.frame(len)
names(len)=c("变量名","最小值","最大值","标准差","均值")
len

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Auto2=Auto[-c(10,85),]
dim(Auto2)

len2=matrix(0,8,4)
for(l in 1:8){
  len2[l,]=range(Auto2[,l])
  len2[l,3]=sd(Auto2[,l])
  len2[l,4]=mean((Auto2[,l]))
}
name2=matrix(names(Auto[,1:8]),8,1)
len2=cbind(name2,len2)
len2=data.frame(len2)
names(len2)=c("变量名","最小值","最大值","标准差","均值")
len2

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

pairs(Auto[,1:8],main="Auto's matrix scatter plot")#矩阵散点图查看大致相关情况
#pairs(Auto[,1:7],main="Auto's matrix scatter plot")

displacement与horsepower、weight呈现线性正相关，与acceleration呈现负相关；
horsepower与weight呈现线性，与acceleration、year呈现负相关正相关；
mpg与horsepower、weight、acceleration呈现正相关，与acceleration呈现负相关。

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

origin=as.factor(Auto$origin)
Auto3=data.frame(Auto[,1:7],origin)

library(GGally)
library(ggplot2)
ggpairs(Auto3, columns=1:8, aes(color=origin)) + ggtitle("matrix scatter plot-Auto)")+theme_bw()

Auto4=data.frame(Auto[,1:7],origin)
name4=names(Auto4)
name4
fm=lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year,Auto4)
summary(fm)
lm.step=step(fm,direction = 'backward')
#lm.step2=step(fm,direction = 'both')