1.The Bias-VarianceTradeoff
ldeally, a model should include all variables that explain the dependent variable andexclude all that do not. In practice, the regression model is having either
Omittedvariables,
Extraneousincluded variables
1.1Omitted Variable
Anomitted variable is one that has a non-zero coefficient but is not included ina model. Omitting a variable has two effects.
Effectsof Omitted Variable Bias
√The remaining variables absorb the effects of the omitted variable attributableto common variation.
Theregression coefficient can't be interpreted
√Theestimated residuals are larger in magnitude than the true shocks.
This is becausethe residuals contain both the true shock and some part of the omittedvariable.
1.2Extraneous included variables
Anextraneous variable is one that is included in the model but is not needed.
Effectsof including irrelevant variables
√Does not bias coefficients
Inlarge samples , the coefficient on an extraneous variable converges to itspopulation value of zero.
√Increasethe uncertainty of the estimated model paremeters
√IncreasR2 but decrease adjusted R2
2.1Tradeoff between bias and variance
Onone hand, models with more explanatory variables have more
estimationerror and also more explanatory power.
Onthe other hand, models with few explanatory variables have less
estimationerror but also less explanatory power.
2.2Two approaches to find the appropriate model complexity
2.2.1.General-to-specific modelselection
1.First includes all relevant variables.
2.Removethe variable with coefficient with the smallest absolute t-statistic(statistically insignificant ).
3.Re-estimateusing the remaining explanatory variables and removeunqualified variables.
4.Repeat the steps mentioned above until the model contains no
coefficientsthat are statistically insignificant.
5.Common choice for a are between 1% and 0.1% (t value are least 2.57or 3.29 ,respectively).
2.2.2.M-fold cross-validation
M-foldcross-validation lt is designed to select a model that performs well infittingobservations not used to estimated the parameter (out-of-sample prediction).
Stepsof m-fold cross-validation
1.The first step is to determine a set of candidate models . lf a dataset has ncandidate explanatory variables, then there are 2n
possiblemodel specifications.
2.Splittingthe data into m equal sized blocks , parameters are2
estimatedusing m-1 blocks (training set) and residuals are
computedwith data in the excluded block . (validation set)
3.Repeatthe process of estimating parameters and computing residual for a total of mtimes (ensure each block is used to compute residual once).
4.Computesum of squared errors for m times and choose the model
withthe smallest out-of-samble sum of squared residual
总结:上面是理论总结,强调的是如何在偏差与方差之间来选择变量,算法的思想重要的强调的是权衡;最后提供了两种选择变量的思路:General-to-specific model selection和 M-foldcross-validation