Feature selection, regularization -- L5 for Data Science

最新推荐文章于 2025-04-07 10:22:54 发布

Tyia LIN

最新推荐文章于 2025-04-07 10:22:54 发布

阅读量94

点赞数

文章标签： python

本文链接：https://blog.csdn.net/m0_50690440/article/details/121235525

版权

本文讨论了在数据科学中，如何通过特征选择和正则化来降低模型的偏差和方差。通过举例说明高阶多项式可能导致高方差，而简单的线性模型可能引入高偏差。介绍了交叉验证的重要性，确保训练和验证集分布一致。此外，解释了正则化的原理，当惩罚项极大时，模型将倾向于忽略大部分特征，减少过拟合。最后提到了特征标准化在正则化前的重要性，以确保不同尺度的特征对模型有公平的影响。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

*The observations are identically distributed means that we are just making random observations without any bias.

* When we have different set of observations, we will get different value of point estimates of coefficients.

For each case, the red star is the true value of beta_0 and beta_1. Every obsevation we make will end up with one green dot.

If we already know the values of betas for the true model, we can then quantify what is the bias and variance of the model we have with respect to true value. This helps us to know what kind of algorithm will give us a high bias or high variance which we don't want.

For example, if we're going to fit the observation with a 10th order polynomial, our model will change drastically. We could imagine with different set of observations, our coefficients will vary so fast with high order polynomial. So we would see the algorithm with high order polynomial fitting is really sensitive to the data we have, then it will give us a high variance.

And for a contrary case, if we just fit our model with a horizontal line. No matter what observation we fit in, it will always predicts a constant, so we will have no variance but an obvious bias, the value of bias will just depends on the value of the constant this line represents.

最低0.47元/天解锁文章