How Does Adjusted R Squared Avoid Model Overfitting (in Linear Regression)

R a d j 2 R^2_{adj} Radj2 Formula

  • R a d j 2 R^2_{adj} Radj2 is a statistical measure that represents the proportion of the variance for a dependent variable ( Y Y Y) that’s explained by an independent variable ( X X X) in a regression model given by E ( Y ∣ X = x ) = β 0 + β 1 x + β 2 x 2 + . . . + β p x p E(Y|X=x)=\beta_0+\beta_1x+\beta_2x^2+...+\beta_px^p E(YX=x)=β0+β1x+β2x2+...+βpxp Moreover, y − E ( Y ∣ X = x ) = ϵ y-E(Y|X=x)=\epsilon yE(YX=x)=ϵ for each x x x value where ϵ \epsilon ϵ is the error caused by randomness assumed to be constant variance σ 2 \sigma^2 σ2 across all x x x values.
  • The formula is given by : R a d j 2 = 1 − R S S / ( n − p − 1 ) T S S / n − 1 R^2_{adj}=1-\frac{RSS/(n-p-1)}{TSS/n-1} Radj2=1TSS/n1RSS/(np1)
  • The denominator is the sample variance of the dependent variable T S S / n − 1 = s 2 = 1 n − 1 ∑ ( y i − y ˉ ) 2 TSS/n-1=s^2=\frac{1}{n-1} \sum (y_i-\bar{y})^2 TSS/n1=s2=n11(yiyˉ)2 and therefore measures the variance in the dependent variable.
    • This is a constant for a given dataset { ( x 1 , y 1 ) , . . , ( x n , y n ) } \{(x_1,y_1),..,(x_n,y_n)\} {(x1,y1),..,(xn,yn)}.
  • The numerator is the residual sum of squares divided by its degree of freedom, which is the estimated variance for the random error ϵ \epsilon ϵ. Specifically, R S S / ( n − p − 1 ) = M S E = σ ^ 2 RSS/(n-p-1)=MSE = \hat\sigma^2 RSS/(np1)=MSE=σ^2 where R S S = ∑ ( f ^ ( x i ) − y i ) 2 RSS= \sum (\hat{f}(x_i)-y_i)^2 RSS=(f^(xi)yi)2 is the sum of residual squares. Since the sum of squared residuals measures the deviation of the fitted y y y values from the observed y y y values, the denominator measures the part in the variance of the dependent variable which gets explained by the regression model we proposed.
  • Thus, the ratio of the two gives the proportion of the variance in the data that can be explained by the model, which makes sense of what R a d j 2 R_{adj}^2 Radj2 intends to measure.
  • A side note regarding the degrees of freedom: n − p − 1 n-p-1 np1 is the degree of freedom for R S S RSS RSS. Here, n n n is the sample size, p p p is the number of coefficients in the model ( β i ^ , i = 1 , . . , p \hat{\beta_i},i=1,..,p βi^,i=1,..,p), subtracted because each of these p p p coefficients is an estimate of the corresponding parameter (i.e. the “truth”), and the 1 1 1 is subtracted due to the fact that β 0 ^ \hat{\beta_0} β0^, the estimate of intercept, is given by β 0 ^ = y ˉ − β 1 ^ x ˉ − β 2 ^ x 2 ˉ − . . . − β p ^ x p ˉ \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}-\hat{\beta_2}\bar{x^2}-...-\hat{\beta_p}\bar{x^p} β0^=yˉβ1^xˉβ2^x2ˉ...βp^xpˉ where y ˉ \bar{y} yˉ is another estimate, reducing R S S RSS RSS degree of freedom by another 1 1 1. (Everytime you estimate, you lose one d f df df).

Comparing Different Linear Models

  • Suppose we fit 10 10 10 linear models, m 1 , . . . , m 10 m_1,...,m_{10} m1,...,m10 to the same data where
    • m 1 m_1 m1 is a first order polynomial model E ( Y ∣ X = x ) = β 0 + β 1 x E(Y|X=x)=\beta_0+\beta_1x E(YX=x)=β0+β1x
    • m 2 m_2 m2 is a second order polynomial model E ( Y ∣ X = x ) = β 0 + β 1 x + β 2 x 2 E(Y|X=x)=\beta_0+\beta_1x+\beta_2x^2 E(YX=x)=β0+β1x+β2x2
    • m 10 m_{10} m10 is a 10 10 10th order polynomial model E ( Y ∣ X = x ) = β 0 + β 1 x + . . . + β 10 x 10 E(Y|X=x)=\beta_0+\beta_1x+...+\beta_{10}x^{10} E(YX=x)=β0+β1x+...+β10x10
  • When we compare these model’s fit to the data, keep in mind that a regression task’s goal is to capture the “truth” as much and as accurately as possible. This means inaddition to examining how capable the model is in reducing R S S RSS RSS(i.e. how well the model explains the total variation in the data), we also want to avoid overfitting so that our model honestly reflects the real trend in the data. Comparing R a d j 2 R_{adj}^2 Radj2 allows us to compare both criterion.
  • To see why, we first notice that the denominator of R a d j 2 R_{adj}^2 Radj2 is a constant, so the magnitude of R a d j 2 R_{adj}^2 Radj2 only depends on the numerator.
  • Fact: R S S RSS RSS will always decrease as the degree of your polynomial model gets higher or as you add more predictors because added terms or predictors always explain more variation in the dataset. Due to this, simply using 1 − R S S S S T = R 2 1-\frac{RSS}{SST}=R^2 1SSTRSS=R2 is obviously not enough to fulfil the goal of model comparison because more complex polynomial models will always win, but obviously too complex models cause the problem of overfitting.
  • This is exactly why we divide R S S RSS RSS by its degree of freedom in R a d j 2 R_{adj}^2 Radj2 so that we take into accoun the model complexity (by the p p p in d f df df). This normalization of R S S RSS RSS will nullify the model complexity advantage on R S S RSS RSS when R S S RSS RSS does not reduce by an enough amount when p p p is increased. Therefore, if R S S m 10 = 497 , d f m 10 = 89 RSS_{m_{10}}=497, df_{m_{10}}=89 RSSm10=497,dfm10=89 while R S S m 10 = 500 , d f m 1 = 98 RSS_{m_{10}}=500, df_{m_{1}}=98 RSSm10=500,dfm1=98, R a d j 2 m 10 {R_{adj}^2}_{m_{10}} Radj2m10 is obviously smaller than R a d j 2 m 1 {R_{adj}^2}_{m_{1}} Radj2m1, leading us to prefer m 1 m_1 m1 in the two because it captures the truth as much as we want yet with much less complexity.
  • Ultimately, when we are comparing m 1 , . . . , m 10 m_1,...,m_{10} m1,...,m10, we want to plot the R a d j 2 R_{adj}^2 Radj2 against the polynomial degrees, find the part of the graph where the R a d j 2 R_{adj}^2 Radj2 does not seem to increase any longer with the degrees, and the corresponding model might be the one we want.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值