实验二 处理多元线性回归问题的数据分析思路

本文通过体育运动员数据集,运用多重线性回归分析了血红蛋白浓度与身高、体重、瘦体重及性别等因素的关系,并进行了预测。结果显示,男性运动员的血红蛋白浓度通常高于女性,且随瘦体重增加而上升,随身高增加而下降。
摘要由CSDN通过智能技术生成

一、主题阐述

bdca453d17b44f1a9862e5de3d1d3d53.png

 

二、建立模型

library(s20x) #引入一个包用于绘图
## 载入数据
athletes.df = read.csv(file = "athletes.csv", header = TRUE)
## 绘制散点图
pairs20x(athletes.df[,c(1,3,4,5)])

f44af47cd2c54448ac94427ba17bf29a.png

观察配对图。我们可以看到Hconc和Height之间有较弱的相关关系;Hconc与weight和LBM都有较强的相关关系。

接下来要分析类别变量sex。继续使用pairs20x函数时出现错误:数据中sex是以"F"和"M"形式代表类别的,无法直接用于分析。因此应先做一个替换。

athletes.df$Sex <- ifelse(athletes.df$Sex == "F", 0, 1) #将数据中的F替换为0
pairs20x(athletes.df[,c(1,2)])

 

8e79ba53078c4ba892f62e67b28c18c5.png

可以看出,Hconc与Sex之间有显著关系。

接下来对数据进行建模。

首先应选取与因变量相关性最强的自变量,通过图可知Sex为相关性最强的自变量。因此:

Hconc.fit=lm(Hconc~Sex,data=athletes.df)
summary(Hconc.fit)

得到:

6f9d5317e68645f4baaebd6b4dc3cad1.png

拟合效果较好。接下来应尝试继续添加其他变量进入模型分析。

这里选取相关性第二强的自变量LBM:

Hconc.fit2=lm(Hconc~Sex+LBM,data=athletes.df)
summary(Hconc.fit2)

得到结果:

ff9e54a688bd499f87818dc74c39e5d2.png

拟合效果较差。与两个自变量Sex和LBM之间有较强相关性有关。

025852eba37e487caf601f5e13187d2e.png

继续添加其他变量:

Hconc.fit3=lm(Hconc~Sex+LBM+Weight,data=athletes.df)
summary(Hconc.fit3)

115c755e4de14497a082e78505984f5e.png

对模型也没有很强的解释力,与自变量之间有较强相关性有关。然而从plot图中可知这些自变量与因变量Hconc之间有较强相关性,因此选择继续将剩余一个变量添加进去进行分析:

Hconc.fit4=lm(Hconc~Sex+LBM+Height+Weight,data=athletes.df)
summary(Hconc.fit4)

3e10bd9ab40541249dccef28bd3ade0f.png

发现Height变量对于模型有很强的解释力。接下来对模型进行优化。

由上图结果可知,Weight变量对模型的解释性最弱(p=0.752,为自变量中最大的)因此优先去除后建立模型:

Hconc.fit5=lm(Hconc~Sex+LBM+Height,data=athletes.df)
summary(Hconc.fit5)

db684aca003f4b02b864fb0eb20b9938.png

此时得到的模型,变量的p值均小于0.05,数据符合模型假设。

29153a016c7d42ce81b1a525242bf11f.png

Cook's distance<0.4,因此无需去除任何数据。

a61085a8d38f4a8983763b974b446669.png

进行正态性检验,基本符合正态分布。

 

三、解决问题

1. What are the predicted haemoglobin concentration levels for a male and a female athlete, both with height 170 cm, weight 70kg, and lean body mass 60 kg?

即:Height=170, LBM=60,Sex=0; Height=170, LBM=60,Sex=1条件下的预测结果(prediction interval, PI). 注意不在模型内的变量不考虑在内。

preHconc1.df<-data.frame(Sex=0,LBM=60,Height=170) #女运动员
predict(Hconc.fit5,preHconc1.df,interval="prediction")
preHconc2.df<-data.frame(Sex=1,LBM=60,Height=170) #男运动员
predict(Hconc.fit5,preHconc2.df,interval="prediction")

设置一个数据框将数值输入,然后用predict函数得到预测值与预测区间:

bcbcb4b9023a4f25b781231b00d8af6a.png

 

2. Is the relationship between haemoglobin concentration and height different for males than it is for females?

这个问题需要我们讨论Height与Sex两个自变量之间的关系。

要分析类别变量与数值变量对因变量的影响关系,可以使用方差分析(ANOVA)来确定不同类别之间的差异是否显著。

Hconc.fit6 = lm(Hconc~Sex*Height, data = athletes.df)
anova(Hconc.fit6)

ac978ea70f514a90bfb0ff45aa44ea45.png

由结果可知Sex:Height项(即为这两个自变量之间的影响关系)的p值大于0.05,因此我们可认为这两个自变量之间不存在相互影响。

 

四、方法和假设检验

Looking at the pairs plot, we saw that Hconc was related to a number of our explanatory variables. So, we want to construct a multiple linear regression model with Hconc as the response variable.

We first selected the independent variable Sex, which has the strongest correlation with the dependent variable Hconc, for initial modeling and found that it can fit well. Subsequently, in order to make the model more effective, other variables were added in descending order of correlation coefficient, and it was found that the fitting effect was poor. Continuing to optimize the model, removing the variable Weight with a high p-value and fitting again, the final model was obtained.

All model assumptions were satisfied.

Our final model is:

4a270dd89fdd41dcb2606c6caf3fc493.png

Our model can explains about 54% of the variability in an athlete's haemoglobin concentration in blood.

 

五、执行摘要

We wanted to build a model to explain the haemoglobin concentration in blood of athletes. 

We use the multiple linear regression to analysis relationship between several independent variables and dependent variables.

Keeping all other variables constant:

• We estimated that male athlete's haemoglobin concentration in blood is 0.87 to 1.91 grams per decalitre higher than that of female athletes, on average.

• We estimated that for each additional kg of  body mass other than fat of athletes, the haemaglobin concentration in blood increases by 0.02 to 0.07 grams per decalitre, on average.

• We estimated that for each additional cm of athlete's height, the haemoglobin concentration in blood decreases by 0.03 to 0.08 grams per decalitre, on average.

The predicted haemoglobin concentration level for a male athlete with height 170 cm, weight 70kg, and lean body mass 60 kg will be between 14.03 and 17.62.

The predicted haemoglobin concentration level for a female athlete with height 170 cm, weight 70kg, and lean body mass 60 kg will be between 12.66 and 16.20.

According to the anova table, the p-value of the covariance of the variables Sex and Height is more than 0.05, so we can consider the relationship between haemoglobin concentration and height is not different for male athlete from it is for female athlete. In addition, it is worth noting that since our model is about athletes, the conclusion cannot be generalized to differences between all males and females.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值