理解偏差和方差平衡技术

最新推荐文章于 2022-10-28 15:23:07 发布

xmsheji

最新推荐文章于 2022-10-28 15:23:07 发布

阅读量6k

点赞数

1.Bias and Variance

Understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models. We define bias and variance in three ways: conceptually, graphically and mathematically.

1.偏差和方差

理解不同错误导致的偏差和方差可以帮助我们提高数据对于模型的集合程度，从而提高模型的争取率。我们从三个方面来定义偏差和方差。这三个方面分别是概念定义，图形定义和数学定义。

1.1Conceptual Definition

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.
Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model

1.1概念定义

由于偏差导致的错误：偏差错误被认为是我们模型预测结果的期望和真实值期望之间的差异。当然你只有一个模型，所以谈论预测结果的期望有点奇怪。但是，想象一下，你不断使用新数据来构造模型，这样你就得到了多个模型，也就得到了多个预测结果。由于模型的数据是随机的，所以会产生一系列的预测。偏差就是衡量这些模型的预测与真实值的差别的。

由于方差导致的错误：由于方差导致的错误被认为是一个模型对于一个数据点的预测的变化程度。想象一下，你可以构建你的模型多次。方差被认为是对于一个数据点来说预测的分散程度。

1.2Graphical Definition

We can create a graphical visualization of bias and variance using a bulls-eye diagram. Imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target. Each hit represents an individual realization of our model, given the chance variability in the training data we gather. Sometimes we will get a good distribution of training data so we predict very well and we are close to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values resulting in poorer predictions. These different realizations result in a scatter of hits on the target.

We can plot four different cases representing combinations of both high and low bias and variance.

1.2图形定义

我们可以用一个打靶图来说明偏差和方差。想象靶心就是我们模型要预测的真实值。当我们离靶子越远时，我们的预测变得越来越糟糕。想象重复整个模型建立的过程来得到多个散点在靶子上。每一个点代表一次模型的实现。当我们接近靶心时，可以认为我们得到了好的训练数据，因此我们可以做出好的预测。但是有时我们的训练数据可能充满了异常值和不标准值，导致预测的结果不好。这些不同的实现可以看做是靶子上的散点。

我们可以看四张图来代表偏差值和方差值的高低程度。