How Does Adjusted R Squared Avoid Model Overfitting (in Linear Regression)

The Well-Built City

于 2021-09-29 07:38:50 发布

阅读量185

点赞数

分类专栏： Statistics Misc 文章标签：统计学

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/Bill_Wang_01/article/details/120540944

版权

Statistics Misc 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Machine Learning

3 篇文章 0 订阅

订阅专栏

$R^2_{adj}$ Formula

$R^2_{adj}$ is a statistical measure that represents the proportion of the variance for a dependent variable ( $Y$ ) that’s explained by an independent variable ( $X$ ) in a regression model given by $E(Y|X=x)=\beta_0+\beta_1x+\beta_2x^2+...+\beta_px^p$ Moreover, $y-E(Y|X=x)=\epsilon$ for each $x$ value where $\epsilon$ is the error caused by randomness assumed to be constant variance $\sigma^2$ across all $x$ values.
The formula is given by : $R^2_{adj}=1-\frac{RSS/(n-p-1)}{TSS/n-1}$
The denominator is the sample variance of the dependent variable $TSS/n-1=s^2=\frac{1}{n-1} \sum (y_i-\bar{y})^2$ and therefore measures the variance in the dependent variable.
- This is a constant for a given dataset ${(x_1,y_1),..,(x_n,y_n)\}$ .
The numerator is the residual sum of squares divided by its degree of freedom, which is the estimated variance for the random error $\epsilon$ . Specifically, $\hat\sigma^2$ where $\sum (\hat{f}(x_i)-y_i)^2$ is the sum of residual squares. Since the sum of squared residuals measures the deviation of the fitted $y$ values from the observed $y$ values, the denominator measures the part in the variance of the dependent variable which gets explained by the regression model we proposed.
Thus, the ratio of the two gives the proportion of the variance in the data that can be explained by the model, which makes sense of what $R_{adj}^2$ intends to measure.
A side note regarding the degrees of freedom: $n - p - 1$ is the degree of freedom for $R S S$ . Here, $n$ is the sample size, $p$ is the number of coefficients in the model ( $\hat{\beta_i},i=1,..,p$ ), subtracted because each of these $p$ coefficients is an estimate of the corresponding parameter (i.e. the “truth”), and the $1$ is subtracted due to the fact that $\hat{\beta_0}$ , the estimate of intercept, is given by $\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}-\hat{\beta_2}\bar{x^2}-...-\hat{\beta_p}\bar{x^p}$ where $\bar{y}$ is another estimate, reducing $R S S$ degree of freedom by another $1$ . (Everytime you estimate, you lose one $d f$ ).

Comparing Different Linear Models

Suppose we fit $10$ linear models, $m_1,...,m_{10}$ to the same data where
- $m_1$ is a first order polynomial model $E(Y|X=x)=\beta_0+\beta_1x$
- $m_2$ is a second order polynomial model $E(Y|X=x)=\beta_0+\beta_1x+\beta_2x^2$
- …
- $m_{10}$ is a $10$ th order polynomial model $E(Y|X=x)=\beta_0+\beta_1x+...+\beta_{10}x^{10}$
When we compare these model’s fit to the data, keep in mind that a regression task’s goal is to capture the “truth” as much and as accurately as possible. This means inaddition to examining how capable the model is in reducing $R S S$ (i.e. how well the model explains the total variation in the data), we also want to avoid overfitting so that our model honestly reflects the real trend in the data. Comparing $R_{adj}^2$ allows us to compare both criterion.
To see why, we first notice that the denominator of $R_{adj}^2$ is a constant, so the magnitude of $R_{adj}^2$ only depends on the numerator.
Fact: $R S S$ will always decrease as the degree of your polynomial model gets higher or as you add more predictors because added terms or predictors always explain more variation in the dataset. Due to this, simply using $1-\frac{RSS}{SST}=R^2$ is obviously not enough to fulfil the goal of model comparison because more complex polynomial models will always win, but obviously too complex models cause the problem of overfitting.
This is exactly why we divide $R S S$ by its degree of freedom in $R_{adj}^2$ so that we take into accoun the model complexity (by the $p$ in $d f$ ). This normalization of $R S S$ will nullify the model complexity advantage on $R S S$ when $R S S$ does not reduce by an enough amount when $p$ is increased. Therefore, if $RSS_{m_{10}}=497, df_{m_{10}}=89$ while $RSS_{m_{10}}=500, df_{m_{1}}=98$ , ${R_{adj}^2}_{m_{10}}$ is obviously smaller than ${R_{adj}^2}_{m_{1}}$ , leading us to prefer $m_1$ in the two because it captures the truth as much as we want yet with much less complexity.
Ultimately, when we are comparing $m_1,...,m_{10}$ , we want to plot the $R_{adj}^2$ against the polynomial degrees, find the part of the graph where the $R_{adj}^2$ does not seem to increase any longer with the degrees, and the corresponding model might be the one we want.

The Well-Built City

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How Does Adjusted R Squared Avoid Model Overfitting (in Linear Regression)

Radj2R^2_{adj}Radj2 FormulaRadj2R^2_{adj}Radj2 is a statistical measure that represents the proportion of the variance for a dependent variable (YYY) that’s explained by an independent variable (XXX) in a regression model given by E(Y∣X=x)=β0+β1x+β2x2+
复制链接

扫一扫

专栏目录

The Well-Built City CSDN认证博客专家 CSDN认证企业博客

码龄4年

24: 原创

60万+: 周排名

220万+: 总排名

5794: 访问

: 等级

256: 积分

2: 粉丝

1: 获赞

5: 评论

2: 收藏

私信

关注

热门文章

分类专栏

最新评论

Central Limit Theorem Overview
Lansonli: 学习了，博主原创不容易啊，过来支持一下~
(Reading) Sources of Power
盼盼编程: 干货满满，很详细，评论占个坑
What Will Happen When Adding a New Variable to a Multiple Linear Regression Model
不吃西红柿丶: 大佬的文章让我受益匪浅，如痴如醉，以后的日子还希望能够得到大佬的谆谆指指点点!
Types of Research & Two Sample t, Randomization, and Bootstrapping Tests
The Well-Built City: Here is the original question that triggered my freind's thought on the logic of the bootstrap test: "It is said that in the bootstrap test, the null hypothesis states that the means of education achievement level are equal for both groups. If we assume variances are equal, then the null hypothesis effectively says that values for both groups come from the same population So does this mean every time we do a bootstrap test, we should also assume variances are equal since this is what is needed to say values are from the same population? Because one of the statements of the null hypothesis in the lecture of the library bootstrap is that 'For a bootstrap test, the null hypothesis is that the values come from the same population'".
Types of Research & Two Sample t, Randomization, and Bootstrapping Tests
The Well-Built City: Just a side note on the bootstrap tests: a friend reminded me that to characterize a population, you have to make sure you get both the mean and the standard deviation (or variance). Thus, in bootstrap tests, you are always holding the mean and the variance constant because the process of generating a bootstrap distribution is the process of simulating the sampling distribution under the null hypothesis, in which your two groups of data come from exactly the same population. When it comes to whether to reject your null hypothesis, when you reject, it's typically because you see inconsistency on ONE OF the mean of variance from that of the bootstrap distribution; when you fail to reject, you fail to gain enough evidence about the inconsistency on BOTH the mean and the variance from those of the boostrap distribution.

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。