2023年最新关于bias与variance的权衡

最新推荐文章于 2024-09-12 19:02:20 发布

光头胖子哥

最新推荐文章于 2024-09-12 19:02:20 发布

阅读量161

点赞数

文章标签：算法概率论 pandas Powered by 金山文档

本文链接：https://blog.csdn.net/m0_47376601/article/details/129633662

版权

文章探讨了在回归分析中，遗漏变量和多余变量对模型的影响，以及偏差和方差之间的平衡。提出了两种选择模型复杂性的方法：一般到特定模型选择和M折交叉验证。前者通过逐步剔除不显著变量来构建模型，后者利用未用于参数估计的数据块来评估模型性能。

摘要由CSDN通过智能技术生成

1.The Bias-VarianceTradeoff

ldeally, a model should include all variables that explain the dependent variable andexclude all that do not. In practice, the regression model is having either

Omittedvariables,

Extraneousincluded variables

1.1Omitted Variable

Anomitted variable is one that has a non-zero coefficient but is not included ina model. Omitting a variable has two effects.

Effectsof Omitted Variable Bias

√The remaining variables absorb the effects of the omitted variable attributableto common variation.

Theregression coefficient can't be interpreted

√Theestimated residuals are larger in magnitude than the true shocks.

This is becausethe residuals contain both the true shock and some part of the omittedvariable.

1.2Extraneous included variables

Anextraneous variable is one that is included in the model but is not needed.

Effectsof including irrelevant variables

√Does not bias coefficients

Inlarge samples , the coefficient on an extraneous variable converges to itspopulation value of zero.

√Increasethe uncertainty of the estimated model paremeters

√IncreasR2 but decrease adjusted R2

2.1Tradeoff between bias and variance

Onone hand, models with more explanatory variables have more

estimationerror and also more explanatory power.

Onthe other hand, models with few explanatory variables have less

estimationerror but also less explanatory power.

2.2Two approaches to find the appropriate model complexity

2.2.1.General-to-specific modelselection

1.First includes all relevant variables.

2.Removethe variable with coefficient with the smallest absolute t-statistic(statistically insignificant ).

3.Re-estimateusing the remaining explanatory variables and removeunqualified variables.

4.Repeat the steps mentioned above until the model contains no

coefficientsthat are statistically insignificant.

5.Common choice for a are between 1% and 0.1% (t value are least 2.57or 3.29 ,respectively).

2.2.2.M-fold cross-validation

M-foldcross-validation lt is designed to select a model that performs well infittingobservations not used to estimated the parameter (out-of-sample prediction).

Stepsof m-fold cross-validation

1.The first step is to determine a set of candidate models . lf a dataset has ncandidate explanatory variables, then there are 2n

possiblemodel specifications.

2.Splittingthe data into m equal sized blocks , parameters are2

estimatedusing m-1 blocks (training set) and residuals are

computedwith data in the excluded block . (validation set)

3.Repeatthe process of estimating parameters and computing residual for a total of mtimes (ensure each block is used to compute residual once).

4.Computesum of squared errors for m times and choose the model

withthe smallest out-of-samble sum of squared residual

总结：上面是理论总结，强调的是如何在偏差与方差之间来选择变量，算法的思想重要的强调的是权衡；最后提供了两种选择变量的思路：General-to-specific model selection和 M-foldcross-validation