摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十一章《应用机器学习的建议》中第87课时《正则化与偏差、方差》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————
You've seen how regularization can help prevent overfitting, but how does it affect the bias and variance of a learning algorithm? In this video, I'd like to go deeper into the issue of bias and variance, and talk about how it interacts with and is affected by the regularization of your learning algorithm.
Suppose we're fitting a high order polynomial like that shown here, but to prevent overfitting, we're going to use regularization, like that shown here. So we have this regularization term to try to keep the values of the parameters small. And as usual, the regularization sums from equals 1 to n rather than
equals 0 to n. Let's consider 3 cases. The first is the case of the very large value of the regularization parameter
, such as
, some huge value. In this case, all of these parameters
and so on will be heavily penalized, and so, we end up with most of these parameters being close to 0. And the hypothesis
will approximately equals
. So we end up with a hypothesis more or less looks like that. This is more or less a flat, constant straight line. And so this hypothesis has high bias and badly underfits this data set. So the horizontal straight line is just not a very good model for this data set. At the other extreme is if we have a very small value of
, such as
equals 0. In that case, given that we're fitting a high order polynomial, this is a usual overfitting setting. In that case, given that we're fitting a high order polynomial, basically without regularization or with very minimal regularization, we end up with our usual high variance, overfitting setting. It's only if we have some intermediate value of
, that is neither too large nor too small, that we end up with parameters
that gives us a reasonable fit to this data. So how can we automatically choose a good value for the regularization parameter
?
Just to reiterate, here's our model and here's our learning algorithm objective. For the setting where we're using regularization, let me define to be something different to be the optimization objective but without the regularization term. Previously, in earlier video, when we are not using regularization, I define
to be the same as
as the cost function. But when we are using regularization with this extra
term, we're going to define
to be just the sum of squared error on the training set, or the average squared error on the training set without taking into account that regularization term. And similarly, I'm then also going to define the cross validation set error and the test set error as before to be the average sum of the squared errors on the cross validation and the test sets. So just to summarize, my definitions of
and
are just the average squared error or one half of the average squared error on my training validation and test sets without the extra regularization term.
So, this is how we can automatically choose the regularization parameter . What I usually do is maybe have some range of values of
I want to try it. So I might be considering not using regularization, or here are a few values I might try, considering
equals 0.01, 0.02, 0.04 and so on. And you know, I usually step these up in multiples of two until some maybe larger value. If I were to do these in multiples of two, I'd end up with a 10.24 instead of 10 exactly, but this is close enough. So, this gives me maybe 12 different models that I'm trying to select among. Corresponding to 12 different values of the regularization parameter
. Of course, you can go to values less than 0.01 or values larger than 10, but I've just truncated it here for convenience. Given each of these 12 model, what we can do is then the following. We can take this model with
, and minimize my cost function
. And this would give me some parameter vector
. And similar to the earlier video, let me just denote this as
. And then I take my second model, with
. And minimize my cost function, to get some different parameter vector
. And for
, I end up with
, and so on. Until for my final model, with
, I end up with this
. Next I can take all of these hypotheses, all of these parameters, and use my cross validation set to evaluate them. So I can look at my first model, my second model, fit with these different values of the regularization parameter and evaluate them on my cross validation set based in measure the average squared error of each of these parameter vector
on my cross validation set. And I would then pick whichever one of these 12 models gives me the lowest error on the cross validation set. And let's say, for the sake of this example, that I end up picking
because that has the lowest cross validation error. Having done that, finally, what I would do if I want to report a test set error is to take the parameter
that I've selected and look at how well it does on my test set. And once again, here is as if we fit this parameter
to my cross validation set, which is why I am saving aside a separate test set that I'm going to use to get a better estimate of how well my parameter vector
will generalize to previously unseen examples. So, that's model selection applied to selecting the regularization parameter
.
The last thing I'd like to do in this video, is getting a better understanding of how cross validation and training error vary as we vary the regularization parameter . And so just a reminder that was our original cost function
. But for this purpose, we're going to define training error without using the regularization parameter, and cross validation error without using the regularization parameter. And what I'd like to do is plot this
and
, meaning just how well does my hypothesis do on the training set and how well does my hypothesis do on my cross validation set as I vary my regularization parameter
. So as we saw earlier, if
is small, then we're not using much regularization, and we run a larger risk of overfitting. Whereas if
is large, that is if we were on the right part of this horizontal axis, then with large value of
, we run the high risk of having a bias problem. So if you plot
and
, what you find is that for small values of
, you can fit the training set relatively well because you're not regularizing. So, for small values of
, the regularization term basically goes away, and you're just minimizing pretty much the squared error. So when
is small, you end up with a small value for
; whereas if
is large, then you have a high bias problem. You might not fit your training set well. So you end up with a value up there. So
will tend to increase when
increases because a large value of
corresponds to a high bias where you might not even fit your training set well. Whereas a small value of
, corresponds to, if you can freely fit to very high degree polynomials to your data, let's say. As for the cross validation error, we end up with a figure like this. Where, over here on the right, if we have a large value of
, we may end up underfitting. And so, this is the bias regime and the cross validation error will be high. Because with high bias, we won't be doing well on cross validation set. Whereas on the left, this is the high variance regime. Where if we have too small value of
, then we may be overfitting the data. Then the cross validation error will also be high. So this is what the cross validation error and what the training error may look like on a training set as we vary the regularization parameter
. And so once again, it will often be some intermediate value of
that is just right or that works best in terms of having a small cross validation error or a small test set error. And whereas the curves I've drawn here are somewhat cartoonish and somewhat idealized. So on a real data set the curves you get may end up looking a little bit more messy and just a little bit more noisy than this. For some data sets you will really see these broad sorts of trends and by looking at a plot of the whole cross validation error, you can either manually or automatically select a point that minimizes the cross validation error and select the value of
corresponding to low cross validation error. When I'm trying to pick the regularization parameter
for a learning algorithm, often I find that plotting a figure like this one shown here helps me understand better what's going on, and helps me verify that I am indeed picking a good value for the regularization parameter
. So hopefully that gives you more insight into regularization and its effects on the bias and variance of the learning algorithm. By now you've seen bias and variance from a lot of different perspectives. And what I'd like to do in the next video is taking all the insights we've gone through and build on them to put together a diagnostic that's called learning curves, which is a tool that I often use to try to diagnose if the learning algorithm may be suffering from a bias problem or a variance problem or a little both.