工程数学第二次作业


title: ‘工程数学: Homework Assignment 02’
author: “渡口归洲”
date: “r Sys.Date()
output: html_document

knitr::opts_chunk$set(echo = TRUE)

Note

  • You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

  • Please combine all your answers, the computer code and the figures into one HTML file.

  • Please use the file name as: 2022XXXXXX_Name_02.html, where 2022XXXXXX means your student ID number.

  • Grading scheme: { 0 ,   1 ,   2   } \left\{ 0,~ 1,~ 2~ \right\} {0, 1, 2 } points per question.

    • 2: submitted on time and more or less correct answer

    • 1: submitted on time and partially correct answer

    • 0: submitted with a completely incorrect answer or late submission (any day after the due date for more than one homework assignment).

  • Send your answers to: xxxxxx@xxx.edu.cn. Due date: 24:00 November 5, 2022 (Saturday evening).

Question 3

The following questions are based on the anscombe data in ℜ \Re .

  1. Plot the 4 data sets ( x 1 ,   y 1 ) \left( x_1 ,~ y_1 \right) (x1, y1), ( x 2 ,   y 2 ) \left( x_2 ,~ y_2 \right) (x2, y2), ( x 3 ,   y 3 ) \left( x_3 ,~ y_3 \right) (x3, y3), ( x 4 ,   y 4 ) \left( x_4 ,~ y_4 \right) (x4, y4) on a 2-by-2 grid of plots using ggplot2 and gridExtra package.(2 points)

Add the number of the data set to each plot as the main title on each plot.

Add the axis-lables using bquote to each plot. For example,

library(ggplot2)
library(magrittr)
library(dplyr)
library(gridExtra)
p1 <- ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x1, y1))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(data = anscombe) +
  geom_point(aes(x = x2, y = y2)) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x2, y2))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p3 <- ggplot(data = anscombe) +
  geom_point(aes(x = x3, y = y3)) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x3, y3))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p4 <- ggplot(data = anscombe) +
  geom_point(aes(x = x4, y = y4)) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  ggtitle(paste0("n=", dim(anscombe %>% select(x4, y4))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1,p2,p3,p4,ncol = 2)
knitr::include_graphics("Homework_1_01.png")
  1. Fit a regression model to the data sets:
  • y 1 ∼ x 1 y_1 \sim x_1 y1x1

  • y 2 ∼ x 2 y_2 \sim x_2 y2x2

  • y 3 ∼ x 3 y_3 \sim x_3 y3x3

  • y 4 ∼ x 4 y_4 \sim x_4 y4x4

using the command lm. Verify that all the fitted models have the exact same coefficients (up to numerical tolerance). (2 points)

lm1 = lm(y1 ~ x1, data = anscombe)
lm2 = lm(y2 ~ x2, data = anscombe)
lm3 = lm(y3 ~ x3, data = anscombe)
lm4 = lm(y4 ~ x4, data = anscombe)
print(lm1)
print(lm2)
print(lm3)
print(lm4)
  1. Use the command cor, compute the sample correlation for each data set. (2 points)
print(cor(anscombe$x1, anscombe$y1, method = "pearson"))
print(cor(anscombe$x2, anscombe$y2, method = "pearson"))
print(cor(anscombe$x3, anscombe$y3, method = "pearson"))
print(cor(anscombe$x4, anscombe$y4, method = "pearson"))
  1. Fit the same models in 2. but with the x x x and y y y reversed. Using the command summary, does anything about the results stay the same when you reverse x x x and y y y? (2 points)
lmx1 = lm(x1 ~ y1, data = anscombe)
lmx2 = lm(x2 ~ y2, data = anscombe)
lmx3 = lm(x3 ~ y3, data = anscombe)
lmx4 = lm(x4 ~ y4, data = anscombe)
summary(lmx1)
summary(lmx2)
summary(lmx3)
summary(lmx4)
  1. Compute the SSE, SST and R 2 R^2 R2 value for each data set. Use the commands mean, sum, predict and / or resid. (Use the original models, i.e., y i ∼ x i y_i \sim x_i yixi so only 4 SSE values) (2 points)
#model_1
predict_1 = predict(lm1, anscombe, interval = "predict")
predict_1 = as.data.frame(predict_1)
sse_1 = sum((predict_1$fit - anscombe$y1)^2)
sst_1 = sum((anscombe$y1-mean(anscombe$y1))^2)
ssr_1 = sum((mean(anscombe$y1)-predict_1$fit)^2)
R2_1 = 1-sse_1/sst_1
cat("sse_1为:",sse_1,"\n")
cat("ssr_1为:",ssr_1,"\n")
cat("R2_1为:",R2_1,"\n")
#model_2
predict_2 = predict(lm2, anscombe, interval = "predict")
predict_2 = as.data.frame(predict_2)
sse_2 = sum((predict_2$fit - anscombe$y2)^2)
sst_2 = sum((anscombe$y2-mean(anscombe$y2))^2)
ssr_2 = sum((mean(anscombe$y2)-predict_2$fit)^2)
R2_2 = 1-sse_2/sst_2
cat("sse_2为:",sse_2,"\n")
cat("ssr_2为:",ssr_2,"\n")
cat("R2_2为:",R2_2,"\n")
#model_3
predict_3 = predict(lm3, anscombe, interval = "predict")
predict_3 = as.data.frame(predict_3)
sse_3 = sum((predict_3$fit - anscombe$y3)^2)
sst_3 = sum((anscombe$y3-mean(anscombe$y3))^2)
ssr_3 = sum((mean(anscombe$y3)-predict_3$fit)^2)
R2_3 = 1-sse_3/sst_3
cat("sse_3为:",sse_3,"\n")
cat("ssr_3为:",ssr_3,"\n")
cat("R2_3为:",R2_3,"\n")
#model_4
predict_4 = predict(lm4, anscombe, interval = "predict")
predict_4 = as.data.frame(predict_4)
sse_4 = sum((predict_4$fit - anscombe$y4)^2)
sst_4 = sum((anscombe$y4-mean(anscombe$y4))^2)
ssr_4 = sum((mean(anscombe$y4)-predict_4$fit)^2)
R2_4 = 1-sse_4/sst_4
cat("sse_4为:",sse_4,"\n")
cat("ssr_4为:",ssr_4,"\n")
cat("R2_4为:",R2_4,"\n")
  1. Using the ggplot2 package, replot the data, adding the regression line to each plot. (Use the original models, i.e., y i ∼ x i y_i \sim x_i yixi so only 4 plots) (2 points)
library(ggplot2)
library(dplyr)
library(gridExtra)
p1 <- ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1)) +
  xlab(bquote(x[1])) +
  ylab(bquote(y[1])) +
  geom_abline(slope = lm1$coef[2], intercept=lm1$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x1, y1))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(data = anscombe) +
  geom_point(aes(x = x2, y = y2)) +
  xlab(bquote(x[2])) +
  ylab(bquote(y[2])) +
  geom_abline(slope = lm2$coef[2], intercept=lm2$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x2, y2))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p3 <- ggplot(data = anscombe) +
  geom_point(aes(x = x3, y = y3)) +
  xlab(bquote(x[3])) +
  ylab(bquote(y[3])) +
  geom_abline(slope = lm3$coef[2], intercept=lm3$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x3, y3))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
p4 <- ggplot(data = anscombe) +
  geom_point(aes(x = x4, y = y4)) +
  xlab(bquote(x[4])) +
  ylab(bquote(y[4])) +
  geom_abline(slope = lm4$coef[2], intercept=lm4$coef[1], color='red', size=1)
  ggtitle(paste0("n=", dim(anscombe %>% select(x4, y4))[1])) +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p1,p2,p3,p4,ncol = 2)

Question 4

In a recent, exciting, but also controversial Science article, \blc Tomasetti and Vogelstein \bc attempt to explain why cancer incidence varies drastically across tissues (e.g. why one is much more likely to develop lung cancer rather than pelvic bone cancer). The authors show that a higher average lifetime risk for a cancer in a given tissue correlates with the rate of replication of stem cells in that tissue. The main inferential tool for their statistical analysis was a simple linear regression, which we will replicate here.

You can download the dataset as follows:

tomasetti = read.csv("https://stats191.stanford.edu/data/Tomasetti.csv")
head(tomasetti)

The dataset contains information about 31 tumour types. The Lscd (Lifetime stem cell divisions) column refers to the total number of stem cell divisions during the average lifetime, while Risk refers to the lifetime risk for cancer of that tissue type.

  1. Fit a simple linear regression model to the data with log(Risk) as the response variable and log(Lscd) as the predictor variable. (2 points)
log_Risk = log(tomasetti$Risk)
log_Lscd = log(tomasetti$Lscd)
data_define = data.frame(risk = log_Risk, lscd = log_Lscd)
tomasetti.lm = lm(risk~lscd, data = data_define)
summary(tomasetti.lm)
  1. Plot the estimated regression line and the data. (2 points)
  ggplot(data = data_define, aes(x = lscd, y = risk)) +
  geom_point() +
  xlab(bquote(lscd)) +
  ylab(bquote(risk)) +
  geom_abline(slope = tomasetti.lm$coef[2], intercept=tomasetti.lm$coef[1], color='red', size=1) 
  1. Add upper and lower 95% prediction bands for the regression line on the plot, using predict. That is, produce one line for the upper limits of the intervals, and one line for the lower limits of the intervals. (2 points)
  confidence_risk = predict(tomasetti.lm, data_define, interval = "confidence")
  head(confidence_risk)
  ggplot(data = data_define, aes(x = lscd, y = risk)) +
  geom_point() +
  xlab(bquote(lscd)) +
  ylab(bquote(risk)) +
  geom_abline(slope = tomasetti.lm$coef[2], intercept=tomasetti.lm$coef[1], color='red', size=1) +
    geom_line(aes(x = lscd, y = as.vector(confidence_risk[ ,"lwr"])), 
 color = "blue", linetype = "dashed", size = 1) + 
 geom_line(aes(x = lscd, y = as.vector(confidence_risk[ ,"upr"])), 
 color = "blue", linetype = "dashed", size = 1)
  1. Test whether the slope in this regression is equal to 0 at level α = 0.05 \alpha = 0.05 α=0.05. State the null hypothesis, the alternative, the conclusion and the p p p-value. (2 points)
tomasetti.lm.reduced = lm(risk~1, data = data_define)
anova(tomasetti.lm.reduced, tomasetti.lm)
#第4问的回答
#假设H0:斜率=0
#    H1:斜率≠0
#由上述命令得到Pr=5.117e-08<α=0.05
#因此拒绝假设H0
  1. What are assumptions you made for question 4. (2 points)
#假设H0:斜率=0
#    H1:斜率≠0
  1. Give a 95% confidence interval for the slope of the regression line. (2 points)
confint(tomasetti.lm, level=0.95)
  1. Report the R 2 R^2 R2 of the model. Report an estimate of the variance of the errors in the model. (2 points)
    R 2 R^2 R2为0.6463,调整之后的 R 2 R^2 R2为0.6341, σ 2 σ^2 σ2为2.9756
summary(tomasetti.lm)
  1. Plot the basic diagnostic plots of your model. What are your conclusions about the model assumptions. (2 points)
plot(data_define$lscd, data_define$risk, pch = 23, bg = "orange", cex = 2, ylab = "log(risk)", xlab = "log(lscd)")
abline(tomasetti.lm, lwd = 2, col = "red", lty = 2)
plot(data_define$lscd, resid(tomasetti.lm), ylab = "log(risk)", xlab = "log(lscd)", pch = 23, bg = "orange", cex 
= 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)

根据第二个图可以看出分布在0附近,且并没有表现出明显的规律,因此我认为是合理的。

  • 6
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值