STA30222

Erikaaa233

于 2020-10-18 20:56:15 发布

阅读量366

点赞数 1

本文链接：https://blog.csdn.net/Erikaaa233/article/details/109149291

版权

STA302 Final Report

"Towards the prediction of admission rates in different universities"

Jiayi YANG (1004244212)

Introduction
When I applied to the universities in the past, I noticed that the admission rate can be greatly different from one another. Some schools have a high admission rate of more than 90%, while some of them only have rates of less than 10%. This phenomenon could be due to varies reasons including campus size, geographic location and so on. Thus, it is important for us to investigate an efficient model for the admission rate.
In the given dataset, there are 1508 schools. The dataset collects the data over 1 response variable, admission rate, and 29 predictor variables. Some of them are qualitative predictors, and others are quantitative predictors. The purpose of this study was preliminary to determine which of the factors best explains the variation exist in the admission rates for different universities. In addition, this project will use these variables to construct a model for estimating the admission rate if certain changes are experienced by the schools.

Methods
To have a good interpretable model with high prediction accuracy, I use methods below:

1. Model validation
To fit an efficient model, we need to narrow down the list of predictors. However, the model selection procedure might change the statistical properties of the estimated regression coefficients. Thus, model validation is needed.
I divided the original dataset into two independent sets: training dataset and validation dataset.
All the variable selection, model building and diagnostics were did based on the training dataset. After constructing a final model based on the training dataset by variable selections, model optimization and model diagnostics, I finally used the validation dataset to evaluate the performance of the model.

2.Variable selections
As I mentioned above, it is important to determine which factor is necessary for the model.
For a good model, it should avoid the problem of multicollinearity. Thus, I use the pair correlation and variance inflation factor to filter out some predictors at first.

2.1Pair correlation
If there is a strong correlation between two predictor variables, it means they provide similar information to the regression. So, in order to select useful variables, I first check the correlation between the quantitative predictors.
If the absolute value of the correlation is higher than 0.8, I only keep one from the pair.

2.2 Variance Inflation Factor (VIF)
I then use VIF. It measures the variance of the coefficient due to the correlation of the two predictors.
I fit a model based on the remaining quantitative variables and check the VIF of them. Here, I filter out the predictor variables with VIF larger than 5.

The problem of multicollinearity is solved to a large extent. Then, we need to consider the model selection method which helps to compare models with a different combination of predictors.

2.3 Akaike’s information criterion(AIC) & Bayesian information criterion(BIC)
“AIC is a statistic which balances the goodness of fit of the model with a penalty term reflecting how complex the model is.” It is one of the criteria that we can use to select models. A smaller value of AIC indicates a better model. BIC and AIC are similar but compare to AIC, BIC has a heavier penalty term, so it will provide a simpler model.

2.4 Forward selection
In this stepwise selection methods, I begin with no predictors and adds one predictor each time. In each step, I consider a model with the same numbers of predictors and select the model with the lowest information criteria and use it for the next step. The process will continue until no more variables left or the value of AIC/BIC increases.

As AIC usually tend to overfit, I choose the BIC model to continue the following steps.

3.Model optimization
Linearity is one of the most important model assumptions. Doing transformation in response and predictor is one of the ways to correct non-linearity.

3.1 Transformation – Box-Cox
Box-Cox method could provide an analytical way of selecting reasonable transformation. Based on the given indication of transformations, I fit a new multiple linear regression model.

This new model is the model after optimizing.

4.Diagnostics
After transformation, it is crucial to check whether the model is doing a reasonable job at fitting data by doing several diagnostics.
4.1 Leverage points
It is a point that sits far from the rest of observations with respect to its predictor value. An observation is identified to be leverage points if:
$*\frac{the number of predictors+1}{observationnumbers}$

4.2 Outliers
In such a moderate dataset, an outlier is one where: the standard residual∄[-2,2]

4.3 Influential points
There are three measurements which could quantify the amount of influence of single observation on the regression line.
4.3.1 Cook’s distance
If cook’ s distance> qf(0.5,p+1,n-(p+1)) , then it is a leverage point.
4.3.2 DFFITSi
It tests the effect of a single observation on its own value.
If $\vert{DFFITS_i}|>2 {\sqrt(\frac{p+1}{n})}$ , then observation i is considered to be influential.
4.3.3 DFBETAS
It shows the influence on the value of the parameter.
If $\vert{DFBETAS_i}|> \frac{n}{\sqrt 2}$ then observationi is considered to be influential.

4.4 Residual plots & Scale-Location & normal QQ-plots
To verify model assumptions, I make a number of residual plots, scale-location plot and QQ-plot. If the residual plots have distinct patterns, it implies the model predictor and response have a non-linear relationship. For Scale location plot, if residuals are spread equally along with the ranges of predictors, homoscedasticity is holding. For normal QQ-plot, if residuals are lined well on a straight dash line, residuals are normally distributed.

Results
There are 1508 observations in total, the first 1131 observations are classified into the training set and the rest are classified into the validation sets.

For the original training dataset. There are 10 qualitative predictors and 19 quantitative predictors. Appendix1 contains a scatter plot matrix of the response and the 18 numeric variables. From Appendix1, some of the relationships between predictors appear to be linear. Thus, we should consider the problem of multicollinearity.

Appendix2 contains the pair correlation between 18 numeric variables. There are 5 pairs of variables having absolute correlation over 8. To keep only one out of each pair. We have13 numeric variables left.

Table1 shows the VIF of the remaining 14 quantitative variables. INC_PCT_LO, PCT_WHITE, PCT_BLACK and PCT_ASIAN have VIF larger than 5, so I eliminate these 4 variables. As UNITID, INSTNM and STABBR have an insignificant relationship with admission rate by common sense, I also eliminate these 3. Thus, before starting model selection, there remain 16 variables.

在这里插入图片描述
By doing forward selection using AIC, it selects 12 variables: AVGFACSAL, CONTROL, HBCU, FEMALE, NUMBRANCH, PFTFAC, PCT_BORN_US, PCTPELL, COSTT4_A, HIS, PCT_BA and REGION.

For forward selection using BIC, it selects AVGFACSAL, CONTROL, HBCU, FEMALE, NUMBRANCH, PFTFAC and PCT_BORN_US, these 7 variables.

Usually, AIC tends to overfit the model, so I choose the result model form BIC.

I fit a native linear model relating these 7 predictors to the response, admission rate.

From Figure1 response against fitted values, it is apparent to see that there is a quadratic relationship between the response and the fitted value. Thus, a possible transformation is needed.

By applying the transformation suggested from Box-Cox to the model, our final multiple linear regression model is:
$ADMRATE^{2.87} hat (estimated)=0.95 -0.058NUMBRANCH^{-6.62} -0.090CONTROL -0.16HBCU -0.026AVGFACSAL^{0.33} -0.096PFTFAC +0.23FEMALE^{1.3} +2.8*10^{-18} PCT_BORN_US^{8.4}$

Checking the presence of extreme observations, there appear to be a number of observations that have been classified as leverage points and outliers. Moreover, there are no points that are influential based on Cook’s distance, but a handful that has been identified by the DFFITS and DFBETAS.

From Figure2, all pairs of quantitative predictors appear to have a linear relationship.
There also appears to be a simple functional relationship between the mean response and the predictors from Figure3.
在这里插入图片描述
To verify model assumptions, Figure3 contains scatter plots of the residuals against each predictor and the fitted value, normal QQ-plot and scale-location plot for final model. From these plots, barring a few extreme observations, it appears that model assumptions are reasonably satisfied.

在这里插入图片描述
Thus, the final model appears to be a valid model for the training dataset.

Finally, I used the validation dataset to evaluate the performance of the model. It moderately fit.

Discussion
By analysis, not all the variables are needed. The final model I generate shows that the average value admission rate is 0.95 when all the predictors are 0. And ADM_RATE(estimated) falls down 0.058 for a unit increase in $NUMBRANCH^ {-6.62}$ when all other predictors are held constant. Similarly, it will fall down 0.026 for a unit increase in $AVGFACSAL^ {0.33}$ , fall 0.096 for a unit increase in PFTFAC, increase 0.23 for a unit increase in $FEMALE^ {1.3}$ , increase $2.8*10^ {-18}$ for a unit increase in $PCTBORNUS ^ {8.4}$ when holding other constant. If the school is historically black college and university, then the admission rate falls 0.16. If the institution is public, ADM_RATE(estimated) fall 0.090; if it is private non-profit, ADM_RATE(estimated) falls 0.18; if it is private for profits, ADM_RATE(estimated) falls 0.27, when holding others constant.
I also observed several limitations from the model. Firstly, the variance of standard errors is not significantly constant. This leads the model assumption to fail to some degree. Besides, I only consider the result of influential based on cook’s distance. Although there shows no influential point from Cook’s distance, there are numbers of influential observation suggested by DFFITS and DFBETAS. Ignoring these points will lead to the inaccuracy of our model. This has been examined in the model validation. The final model does not fit testing dataset really well.

(Word count for content only: 1484)

**Appendix **
在这里插入图片描述

(Appendix3: code)

knitr::opts_chunk$set(echo = TRUE)

# Load useful Library
library(car)
library(leaps)

#load data
data <- read.csv("C:/Users/安静的蛙w/Desktop/FP_dataset.csv", header=T)


#summary variables
str(data)
summary(data)

#training set and testing set
training <- data[1:1131,]
testing <- data[1132:1508,]
str(training)
summary(training)

#scatterplot
pairs(training[,13:31])

#check multicollinearity
cor(cbind(training[,14:31],training$ADM_RATE))
modfull <- lm(training$ADM_RATE ~ training$NUMBRANCH + training$COSTT4_A + training$INC_PCT_LO + 
                training$AVGFACSAL + training$PFTFAC + training$PCTPELL + 
                training$UG25ABV + training$PAR_ED_PCT_1STGEN + training$FEMALE+ 
                training$PCT_WHITE + training$PCT_BLACK + training$PCT_ASIAN + 
                training$PCT_BA + training$PCT_BORN_US + training$UNEMP_RATE, 
              data = training)
vif(modfull)

#new dataset: tr1
tr1 = training[, c(5:18,20, 21, 27, 29, 31)]

#Seleceting variables to  fit the best model by AIC and BIC
#forward AIC
stepAIC(lm(ADM_RATE ~ 1, data=tr1), 
        scope=list(upper=lm(ADM_RATE ~ ., data = tr1)), 
        direction = "forward", k=2)

#Seleceting variables to  fit the best model by AIC and BIC
#forward BIC
stepAIC(lm(ADM_RATE ~ 1, data=tr1), 
        scope=list(upper=lm(ADM_RATE ~ ., data = tr1)), 
        direction = "forward",  k=log(nrow(tr1)))

#model from BIC result
MB <- lm(ADM_RATE ~ AVGFACSAL + CONTROL + HBCU + FEMALE + 
    NUMBRANCH + PFTFAC + PCT_BORN_US, data = tr1) 


plot(rstandard(MB) ~ fitted(newM), xlab="Fitted", ylab="Residuals")
plot(rstandard(MB) ~ tr1$NUMBRANCH, xlab="NUMBRANCH^", ylab="Residuals")
plot(rstandard(MB) ~ tr1$CONTROL, xlab="CONTROL", ylab="Residuals")
plot(rstandard(MB) ~ tr1$HBCU, xlab="HBCU", ylab="Residuals")
plot(rstandard(MB) ~ tr1$AVGFACSAL, xlab="AVGFACSAL", ylab="Residuals")
plot(rstandard(MB) ~ tr1$PFTFAC, xlab="PFTFAC", ylab="Residuals")
plot(rstandard(MB) ~ tr1$FEMALE, xlab="FEMALE", ylab="Residuals")
plot(rstandard(MB) ~ tr1$PCT_BORN_US, xlab="PCT_BORN_US", ylab="Residuals")

plot(tr1$ADM_RATE ~ MB$fitted.values, xlab="Fitted Values", ylab="ADM_RATE")
abline(a = 0, b = 1, lty=2)
lines(lowess(MB$fitted.values, tr1$ADM_RATE))
legend("topleft", legend=c("Identity", "Smooth Fit"), lty=c(2, 1))

#Variable Transformation for forward BIC
trans <- powerTransform(lm(cbind(tr1[,9]+1, tr1$NUMBRANCH, tr1$AVGFACSAL, 
                                 tr1$PFTFAC, tr1$FEMALE, tr1$PCT_BORN_US) ~ 1))
summary(trans)

#best model 
newM <- lm(I(ADM_RATE^2.87) ~ I(NUMBRANCH^(-6.62)) + CONTROL + HBCU + I(AVGFACSAL^0.33) + 
            PFTFAC + I(FEMALE^1.3) + I(PCT_BORN_US^8.40), data = tr1) 
summary(new)

# Model Diagnostics
#leverage points
h <- hatvalues(newM)
threshold <- 2 * (length(newM$coefficients)/nrow(tr1))
w <- which(h > threshold)
w

#outliers
r <- rstandard(newM)
w <- which(r > 2 | r < -2)
tr1[w,]
r[w]

#influential points
D <- cooks.distance(newM)
cutoff <- qf(0.5, 8, nrow(tr1)-8, lower.tail=T)
d <- which(D > cutoff)
d

fits <- dffits(newM)
cutoff <- 2*sqrt(8/nrow(tr1))
f <- which(abs(fits) > cutoff)
f

dfb <- dfbetas(new)
cutoff <- 2/sqrt(nrow(tr1))
b <- which(abs(dfb[,8]) > cutoff)
b

# Model Diagnostics
#plots
#residual plots
plot(newM, 1)
plot(newM, 3)
plot(newM, 4)

pairs(tr1[,c(9,1,2,4,11,12,16,18)])

par(mfrow=c(2,3))

plot(rstandard(newM) ~ fitted(newM), xlab="Fitted", ylab="Residuals")
plot(rstandard(newM) ~ I((tr1$NUMBRANCH)^(-6.62)), xlab="NUMBRANCH^(-6.62)", ylab="Residuals")
plot(rstandard(newM) ~ tr1$CONTROL, xlab="CONTROL", ylab="Residuals")
plot(rstandard(newM) ~ tr1$HBCU, xlab="HBCU", ylab="Residuals")
plot(rstandard(newM) ~ I((tr1$AVGFACSAL)^0.33), xlab="AVGFACSAL^0.33", ylab="Residuals")
plot(rstandard(newM) ~ tr1$PFTFAC, xlab="PFTFAC", ylab="Residuals")
plot(rstandard(newM) ~ I((tr1$FEMALE)^1.3), xlab="FEMALE^1.3", ylab="Residuals")
plot(rstandard(newM) ~ I((tr1$PCT_BORN_US)^8.40), xlab="PCT_BORN_US^8.40", ylab="Residuals")



qqnorm(rstandard(newM))


abline(a = 0, b = 1)
plot(tr1$ADM_RATE ~ fitted(new), xlab="Fitted", ylab="ADM_RATE")
abline(a = 0, b = 1, lty=2)
par(mfrow=c(1,1))
plot(tr1$ADM_RATE ~ newM$fitted.values, xlab="Fitted Values", ylab="ADM_RATE")

#fit same model in validation dataset
str(testing)
summary(lm(I(ADM_RATE^2.87) ~ I(NUMBRANCH^(-6.62)) + CONTROL + HBCU + I(AVGFACSAL^0.33) + 
            PFTFAC + I(FEMALE^1.3) + I(PCT_BORN_US^8.40), data = testing))

References:
Daignault, K. STA302W6A_slides [PowerPoint presentation]. Retrieved from https://q.utoronto.ca/courses/154219/pages/week-6-part-a-materials?module_item_id=1427861

Daignault, K. STA302W6B_slides [PowerPoint presentation]. Retrieved from https://q.utoronto.ca/courses/154219/pages/week-6-part-b-materials?module_item_id=1427862

Daignault, K. STA302W5B_slides [PowerPoint presentation]. Retrieved from https://q.utoronto.ca/courses/154219/pages/week-5-part-b-materials?module_item_id=1414867

Sheather, S. J. (2010). A Modern approach to regression with R. New York: Springer.

Erikaaa233

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
STA30222

STA302 Final Report"Towards the prediction of admission rates in different universities"Jiayi YANG (1004244212)IntroductionWhen I applied to the universities in the past, I noticed that the admission rate can be greatly different from one another. Som.
复制链接

扫一扫