工程数学第二次作业

最新推荐文章于 2024-08-03 18:44:36 发布

渡口归洲

最新推荐文章于 2024-08-03 18:44:36 发布

阅读量388

点赞数 6

文章标签： r语言

本文链接：https://blog.csdn.net/qq_44637676/article/details/127642323

版权

title: ‘工程数学: Homework Assignment 02’
author: “渡口归洲”
date: “`r Sys.Date()`”
output: html_document

knitr::opts_chunk$set(echo = TRUE)

Note

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one HTML file.
Please use the file name as: 2022XXXXXX_Name_02.html, where 2022XXXXXX means your student ID number.
Grading scheme: $\left\{ 0,~ 1,~ 2~ \right\}$ points per question.
- 2: submitted on time and more or less correct answer
- 1: submitted on time and partially correct answer
- 0: submitted with a completely incorrect answer or late submission (any day after the due date for more than one homework assignment).
Send your answers to: xxxxxx@xxx.edu.cn. Due date: 24:00 November 5, 2022 (Saturday evening).

Question 3

The following questions are based on the anscombe data in $\Re$ .

Plot the 4 data sets $\left( x_1 ,~ y_1 \right)$ , $\left( x_2 ,~ y_2 \right)$ , $\left( x_3 ,~ y_3 \right)$ , $\left( x_4 ,~ y_4 \right)$ on a 2-by-2 grid of plots using ggplot2 and gridExtra package.(2 points)

Add the number of the data set to each plot as the main title on each plot.

Add the axis-lables using bquote to each plot. For example,

library(ggplot2)
library(magrittr)
library(dplyr)
library(gridExtra)
p1 <- ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x1, y1))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(data = anscombe) +
  geom_point(aes(x = x2, y = y2)) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x2, y2))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p3 <- ggplot(data = anscombe) +
  geom_point(aes(x = x3, y = y3)) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x3, y3))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p4 <- ggplot(data = anscombe) +
  geom_point(aes(x = x4, y = y4)) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x4, y4))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1,p2,p3,p4,ncol = 2)

knitr::include_graphics("Homework_1_01.png")

Fit a regression model to the data sets:

$y_1 \sim x_1$
$y_2 \sim x_2$
$y_3 \sim x_3$
$y_4 \sim x_4$

using the command lm. Verify that all the fitted models have the exact same coefficients (up to numerical tolerance). (2 points)

lm1 = lm(y1 ~ x1, data = anscombe)
lm2 = lm(y2 ~ x2, data = anscombe)
lm3 = lm(y3 ~ x3, data = anscombe)
lm4 = lm(y4 ~ x4, data = anscombe)
print(lm1)
print(lm2)
print(lm3)
print(lm4)

Use the command cor, compute the sample correlation for each data set. (2 points)

print(cor(anscombe$x1, anscombe$y1, method = "pearson"))
print(cor(anscombe$x2, anscombe$y2, method = "pearson"))
print(cor(anscombe$x3, anscombe$y3, method = "pearson"))
print(cor(anscombe$x4, anscombe$y4, method = "pearson"))

Fit the same models in 2. but with the $x$ and $y$ reversed. Using the command summary, does anything about the results stay the same when you reverse $x$ and $y$ ? (2 points)

lmx1 = lm(x1 ~ y1, data = anscombe)
lmx2 = lm(x2 ~ y2, data = anscombe)
lmx3 = lm(x3 ~ y3, data = anscombe)
lmx4 = lm(x4 ~ y4, data = anscombe)
summary(lmx1)
summary(lmx2)
summary(lmx3)
summary(lmx4)

Compute the SSE, SST and $R^2$ value for each data set. Use the commands mean, sum, predict and / or resid. (Use the original models, i.e., $y_i \sim x_i$ so only 4 SSE values) (2 points)

#model_1
predict_1 = predict(lm1, anscombe, interval = "predict")
predict_1 = as.data.frame(predict_1)
sse_1 = sum((predict_1$fit - anscombe$y1)^2)
sst_1 = sum((anscombe$y1-mean(anscombe$y1))^2)
ssr_1 = sum((mean(anscombe$y1)-predict_1$fit)^2)
R2_1 = 1-sse_1/sst_1
cat("sse_1为：",sse_1,"\n")
cat("ssr_1为：",ssr_1,"\n")
cat("R2_1为：",R2_1,"\n")

#model_2
predict_2 = predict(lm2, anscombe, interval = "predict")
predict_2 = as.data.frame(predict_2)
sse_2 = sum((predict_2$fit - anscombe$y2)^2)
sst_2 = sum((anscombe$y2-mean(anscombe$y2))^2)
ssr_2 = sum((mean(anscombe$y2)-predict_2$fit)^2)
R2_2 = 1-sse_2/sst_2
cat("sse_2为：",sse_2,"\n")
cat("ssr_2为：",ssr_2,"\n")
cat("R2_2为：",R2_2,"\n")

#model_3
predict_3 = predict(lm3, anscombe, interval = "predict")
predict_3 = as.data.frame(predict_3)
sse_3 = sum((predict_3$fit - anscombe$y3)^2)
sst_3 = sum((anscombe$y3-mean(anscombe$y3))^2)
ssr_3 = sum((mean(anscombe$y3)-predict_3$fit)^2)
R2_3 = 1-sse_3/sst_3
cat("sse_3为：",sse_3,"\n")
cat("ssr_3为：",ssr_3,"\n")
cat("R2_3为：",R2_3,"\n")

#model_4
predict_4 = predict(lm4, anscombe, interval = "predict")
predict_4 = as.data.frame(predict_4)
sse_4 = sum((predict_4$fit - anscombe$y4)^2)
sst_4 = sum((anscombe$y4-mean(anscombe$y4))^2)
ssr_4 = sum((mean(anscombe$y4)-predict_4$fit)^2)
R2_4 = 1-sse_4/sst_4
cat("sse_4为：",sse_4,"\n")
cat("ssr_4为：",ssr_4,"\n")
cat("R2_4为：",R2_4,"\n")

Using the ggplot2 package, replot the data, adding the regression line to each plot. (Use the original models, i.e., $y_i \sim x_i$ so only 4 plots) (2 points)

library(ggplot2)
library(dplyr)
library(gridExtra)
p1 <- ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  geom_abline(slope = lm1$coef[2], intercept=lm1$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x1, y1))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(data = anscombe) +
  geom_point(aes(x = x2, y = y2)) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  geom_abline(slope = lm2$coef[2], intercept=lm2$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x2, y2))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p3 <- ggplot(data = anscombe) +
  geom_point(aes(x = x3, y = y3)) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  geom_abline(slope = lm3$coef[2], intercept=lm3$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x3, y3))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p4 <- ggplot(data = anscombe) +
  geom_point(aes(x = x4, y = y4)) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  geom_abline(slope = lm4$coef[2], intercept=lm4$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x4, y4))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1,p2,p3,p4,ncol = 2)

Question 4

In a recent, exciting, but also controversial Science article, \blc Tomasetti and Vogelstein \bc attempt to explain why cancer incidence varies drastically across tissues (e.g. why one is much more likely to develop lung cancer rather than pelvic bone cancer). The authors show that a higher average lifetime risk for a cancer in a given tissue correlates with the rate of replication of stem cells in that tissue. The main inferential tool for their statistical analysis was a simple linear regression, which we will replicate here.

You can download the dataset as follows:

tomasetti = read.csv("https://stats191.stanford.edu/data/Tomasetti.csv")
head(tomasetti)

The dataset contains information about 31 tumour types. The Lscd (Lifetime stem cell divisions) column refers to the total number of stem cell divisions during the average lifetime, while Risk refers to the lifetime risk for cancer of that tissue type.

Fit a simple linear regression model to the data with log(Risk) as the response variable and log(Lscd) as the predictor variable. (2 points)

log_Risk = log(tomasetti$Risk)
log_Lscd = log(tomasetti$Lscd)
data_define = data.frame(risk = log_Risk, lscd = log_Lscd)
tomasetti.lm = lm(risk~lscd, data = data_define)
summary(tomasetti.lm)

Plot the estimated regression line and the data. (2 points)

  ggplot(data = data_define, aes(x = lscd, y = risk)) +
  geom_point() +
  xlab(bquote(lscd)) +
  ylab(bquote(risk)) +
  geom_abline(slope = tomasetti.lm$coef[2], intercept=tomasetti.lm$coef[1], color='red', size=1)

Add upper and lower 95% prediction bands for the regression line on the plot, using predict. That is, produce one line for the upper limits of the intervals, and one line for the lower limits of the intervals. (2 points)

  confidence_risk = predict(tomasetti.lm, data_define, interval = "confidence")
  head(confidence_risk)
  ggplot(data = data_define, aes(x = lscd, y = risk)) +
  geom_point() +
  xlab(bquote(lscd)) +
  ylab(bquote(risk)) +
  geom_abline(slope = tomasetti.lm$coef[2], intercept=tomasetti.lm$coef[1], color='red', size=1) +
    geom_line(aes(x = lscd, y = as.vector(confidence_risk[ ,"lwr"])), 
 color = "blue", linetype = "dashed", size = 1) + 
 geom_line(aes(x = lscd, y = as.vector(confidence_risk[ ,"upr"])), 
 color = "blue", linetype = "dashed", size = 1)

Test whether the slope in this regression is equal to 0 at level $\alpha = 0.05$ . State the null hypothesis, the alternative, the conclusion and the $p$ -value. (2 points)

tomasetti.lm.reduced = lm(risk~1, data = data_define)
anova(tomasetti.lm.reduced, tomasetti.lm)

#第4问的回答
#假设H0：斜率=0
#    H1：斜率≠0
#由上述命令得到Pr=5.117e-08<α=0.05
#因此拒绝假设H0

What are assumptions you made for question 4. (2 points)

#假设H0：斜率=0
#    H1：斜率≠0

Give a 95% confidence interval for the slope of the regression line. (2 points)

confint(tomasetti.lm, level=0.95)

Report the $R^2$ of the model. Report an estimate of the variance of the errors in the model. (2 points)
$R^2$ 为0.6463，调整之后的 $R^2$ 为0.6341， $σ^2$ 为2.9756

summary(tomasetti.lm)

Plot the basic diagnostic plots of your model. What are your conclusions about the model assumptions. (2 points)

plot(data_define$lscd, data_define$risk, pch = 23, bg = "orange", cex = 2, ylab = "log(risk)", xlab = "log(lscd)")
abline(tomasetti.lm, lwd = 2, col = "red", lty = 2)

plot(data_define$lscd, resid(tomasetti.lm), ylab = "log(risk)", xlab = "log(lscd)", pch = 23, bg = "orange", cex 
= 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)

根据第二个图可以看出分布在0附近，且并没有表现出明显的规律，因此我认为是合理的。