What Will Happen When Adding a New Variable to a Multiple Linear Regression Model
- Conclusions:
- This new (different from existing ones) variable will ALWAYS reduce Residual Sum of Squares ( R S S RSS RSS)
- However, a large p p p for the partial t-test for this new variable leads us to conclude that this variable is not a useful or valid one and thus should not be included in the model.
- Explanations:
- Formula: R S S = ∑ i ( y i − y i ^ ) 2 RSS=\sum_{i}(y_i-\hat {y_i})^2 RSS=∑i(yi−yi^)2, y i y_i yi as observed values of the response variable at x i x_i xi, and y i ^ \hat {y_i} yi^ is the estimated mean value of the unobservable random variable Y Y Y at x i x_i xi estimated/calculated by the regression model fitted. Notice that we assume no error in the observed values of the explanatory variable(s) as the predictor(s) i.e. we treat the predictors’ values as fixed all the time.
- From the formula, we can see that R S S RSS RSS measures how many variability in the observed values of the response variable (i.e. the dataset used) is NOT explained by the regression model fitted.
- Adding a new variable and calculating the model parameters by minimizing R S S RSS RSS will always make the regression model explain a larger proportion of the variations in the observed y y y values (see https://stats.stackexchange.com/questions/179244/is-rss-decreasing-or-non-increasing).
- This means,regardless of whether a newly added variable makes sense in the model, R S S RSS RSS will always decrease, which cause R 2 R^2 R2 to increase. This leads to the caveat that using R 2 R^2 R2 to test whether to add a new variable in the model is not appropriate .
- The proper way to test whether the new variable is really statistically significant is through the p p p value produced by the partial t-test on this new variable (or any equivalences like the partial F F F test on the new variable). The null hypothesis is always the original model and the alternative hypothesis the original model plus the new variable (the new model). If p p p is large, then we fail to reject the null hypothesis and conclude that the model with the new variable is not statistically significant against the original model, so we still use the “old model”.