STAT 371 S24 #3SQL

最新推荐文章于 2024-08-14 21:00:35 发布

yyddcc33

最新推荐文章于 2024-08-14 21:00:35 发布

阅读量318

点赞数 3

文章标签：开发语言

本文链接：https://blog.csdn.net/yyddcc33/article/details/140717705

版权

Java Python STAT 371 S24 Assignment #3

(Submission deadline: 11:59 pm Fri., July 19th)

In this assignment, we will continue with developing a suitable regression model for your CEO dataset from Assignment #2, continuing with your fitted model from 2e) of the assignment (i.e. the model fit without the BACKGRD variate).

1) Plot the residuals vs the fitted values, as well as a QQ plot. Comment on the adequacy of the fitted model, in terms of the model assumptions.

2) One approach to stabilize the variance of the residuals and/or more adequately describe the relationship between a response variate and the explanatory variates is with an appropriate transformation of the response variate.

a) Create a histogram of CEO compensation. What characteristic of this variate might lead you to suspect that a log transformation maybe suitable?

b) Refit the data using the (natural) log transformation of compensation.

c) Compare the overall fit of the model and significance of the individual parameters with that of the original (untransformed) model.

d) Replot the two residual plots in 1). Has the transformation helped to address the issues with the adequacy of the (untransformed) model?

3) We can also investigate the suitability of transformations of one or more of the explanatory variates by looking at scatterplots of the variates vs the response (log(COMP), in this case).

a) Create a scatterplot of SALES vs log(COMP). Does a linear model seem appropriate for these two variates?

b) Create a scatterplot of log(SALES) vs log(COMP). Comment.

c) Refit the model once again, this time taking the log transformation of compensation as well as of the variates SALES, VAL, PCNTOWN and PROF. We will use this model going forward. Comment on the effect these transformations have on the overall fit of the model, and on the p-values of the associated variates.

4) Plot the residuals vs the fitted values and the QQ STAT 371 S24 Assignment #3SQL plot for the model in 3). Comment on the effect of the transformations on the model assumptions.

5) Replot the plots in 4) using the studentized residuals. Do you notice any major changes in these plots? Are there any outliers present?

6) Plot the hat values vs index (observation number). Are there any high leverage points?

7) Investigate the observation with the highest leverage for a possible cause.

8) Plot the Cook’s Distance values. Are there any influential cases?

9) Now that we have obtained a more adequate model through transformation of the response and some of the explanatory variables, we can further improve the model by using model selection methods.to select which subset of variables to include.

a) Use backward selection to arrive at a reasonable model (use α = .15). Show your work.

b) Use the leaps function in R to select a model, based on Mallow’s Cp and adjusted R2. (You may first need to install and load the leaps package). Select the model that yields the largest adjusted R-squared and meets the Mallow’s Cp criterion (Cp < k+1). Comment on the overall fit and the significance of the model parameters.

c) Confirm the Mallow’s Cp value for this model by calculating the value from information in the summary output of this model and of the full model.

d) Did the model selection procedures in a) and b) arrive at the same model?

e) Perform an additional sum of squares test on the full model (model in 3c) and reduced model (model in 9b) using the anova function. Be sure to state the conclusion in the context of the study.

f) Plot the studentized residuals vs the fitted values and the QQ plot of the studentized residuals to confirm that your preferred model is adequate in terms of the model assumptions.

g) Finally, recalculate the 95% prediction interval for the CEO in 2e) of Assignment #2, based on your preferred model. Be sure to back transform. to the original units