Advice for applying machine learning - Model selection and training/validation/test sets

最新推荐文章于 2020-07-27 22:12:32 发布

王彩旗 edwardwangcq.com

最新推荐文章于 2020-07-27 22:12:32 发布

阅读量235

点赞数

分类专栏：人工智能 # 机器学习

本文链接：https://blog.csdn.net/edward_wang1/article/details/107094283

版权

人工智能同时被 2 个专栏收录

142 篇文章 0 订阅

订阅专栏

机器学习

109 篇文章 0 订阅

订阅专栏

本文介绍了如何在机器学习中解决模型选择问题，特别是选择多项式阶数或正则化参数。文章提出了将数据分为训练、验证和测试集的概念，以避免过拟合和不公平的泛化误差估计。通过使用交叉验证集来选择模型，并用测试集评估最终模型的性能。

摘要由CSDN通过智能技术生成

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十一章《应用机器学习的建议》中第85课时《模型选择和训练/验证/测试集》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————
Suppose you'd like to decide what degree of polynomial to fit to a data set. So that what features to include to give you a learning algorithm. Or suppose you'd like to choose the regularization parameter $\lambda$ for learning algorithm. How do you do that? These are called model selection problems. And in our discussion of how to do this, we'll talk about not just how to split your data into the training and test sets, but how to switch data into what we discover is called the train, validation and the test sets. We'll see in this video just what these things are, and how to use them to do model selection.

We've already seen a lot of times of the problem of overfitting, in which just because a learning algorithm fits a training set well, that doesn't mean it's a good hypothesis. More generally, this is why the training set's error is not a good predictor for how well the hypothesis will do on new example. Concretely, if you fit some set of parameters $\theta _{0}, \theta _{1}, \theta _{2}$ and so on, to your training set. Then the fact that your hypothesis does well on training set. Well, this doesn't mean much in term of predicting how well your hypothesis will generalize to new examples not seen in the training set. And a more general principle is that once your parameters fit to some of the data, maybe the training set, maybe something else. Then the error of your hypothesis is measured on that same data set, such as the training error, that's unlikely to be a good estimate of your actual generalization error, that is how well the hypothesis will generalize to new examples.

Now let's consider the model selection problem. Let's say you're trying to choose what degree polynomial to fit to data. So, should you choose a linear function, a quadratic function, a cubic function, all the way up to the $10^{th}$ -order polynomial. So it's as if there's one extra parameter in this algorithm which I'm going to denote , which is, what degree of polynomial you want to pick. So as if in addition to the $\theta$ parameters, it's as if there's one more parameter that you're trying to determine using your data set. So, the first option is d=1 if you fit a linear function. We can choose d=2 , d=3 , all the way up to d=10 . So, we'd like to fit this extra parameter which I'm denoting by . And concretely let's say that you want to choose a model, that is choose a degree of polynomial, choose one of these 10 models. And fit that model and also get some estimate of how well your fitted hypothesis was generalized to new examples. Here's one thing you could do. What you could, first take your first model and minimize the training error. And this will give you some parameter vector $\theta$ . And you could then take your second model, the quadratic function, and fit that to your training set, and this will give you some other parameter vector $\theta$ . In order to distinguish between these different parameter vectors, I'm going to use a superscript (1) , (2) there. Where $\theta ^{(1)}$ just means the parameters I get by fitting this model to my training data. And $\theta ^{(2)}$ just means the parameters I get by fitting this quadratic function to my training data and so on. By fitting a cubic model I get parameter $\theta ^{(3)}$ , up to, say $\theta ^{(10)}$ . And one thing we could do is then take these parameters and look at test set error. So I can compute on my test set $J_{test}(\theta ^{(1)})$ , $J_{test}(\theta ^{(2)})$ and so on. So, I'm going to take each of my hypotheses with the corresponding parameters and just measure the performance on the test set. Now, one thing I could do then is, in order to select one of these models, I could then see which model has the lowest test set error. And let's just say for this example that I end up choosing the fifth order polynomial. So, this seems reasonable so far. But now let's say I want to take my fifth hypothesis, this fifth order model, and let's say I want to ask how well this model generalize? One thing I could do is look at how well my fifth order polynomial hypothesis had done on my test set. But the problem is this will not be a fair estimate of how well my hypothesis generalizes. And the reason is what we've done is we've fit this extra parameter , that is this degree of polynomial. And we fit that parameter using the test set. Namely, we choose the value of that gave us the best possible performance on the test set. And so, the performance of my parameter vector $\theta ^{(5)}$ on the test set, that's likely to be an overly optimistic estimate of generalization error, right? So that because I had fit this parameter to my test set is no longer fair to evaluate my hypothesis on this test set, because I fit my parameters to this test set, I've choosen the degree of polynomial using the test set. And so my hypothesis is likely to do better on this test set than it would on new examples it hasn't seen before, and that's what I really care about. So just to reiterate, on the previous slide, we saw that if we fit some set of parameters, say $\theta _{0}, \theta _{1}, \theta _{2}$ an so on, to some training set, then the performance of fitted model on the training set is not predictive of how well the hypothesis will generalize to new examples. It's because these parameters were fit to the training set, so they're likely to do well on the training set, even if the parameters don't do well on other examples. And, in the procedure I just described on this slide, we just did the same thing. And specifically, what we did was we fit this parameter to the test set. And by having fit the parameter to the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate of how well the hypothesis is likely to do on the examples we haven't seen before.

To address this problem in a model selection setting, if we want to evaluate a hypothesis, this is what we usually do instead. Given the data set, instead of just splitting it into a training, test set, what we're going to do is splitting it into 3 pieces. And the first piece is going to be called the training set as usual. So let me call this first part the training set. And the second piece of this data, I'm going to call the cross validation set. And I'm going to abbreviate cross validation as cv. Sometimes it's also called the validation set instead of cross validation set. And then the last part I'm going to call the usual test set. A pretty typical ratio at which to split these things will be to send 60% of your data to your training set, maybe 20% to your cross validation set, and 20% to your test set. And these numbers can vary a little bit but this sort of ratio will be pretty typical. And so our training set will now be only maybe 60% of the data, and our cross validation set, or our validation set, will have some number of examples. I'm going to denote that $m_{cv}$ . So that's the number of cross validation examples. Following our early notational convention, I'm going to use $(x^{(i)}_{cv}, y^{(i)}_{cv})$ to denote the $i^{th}$ cross validation example. And finally, we also have a test set over here with $m_{test}$ being the number of test examples.

So, now that we've defined the training validation or cross validation and test sets. We can also define the training error, cross validation error, and test error. So here's my training error, and I'm just writing this as $J_{train}(\theta )$ . This is pretty much the same thing. These are the same thing as the $J(\theta )$ that I've written so far. This is just a training set error as measuring a training set. And then $J_{cv}(\theta )$ is my cross validation error, this is pretty much what you'd expect, just like the training error except measured on the cross validation data set. And here's my test set error same as before.

So when faced with a model selection problem like this, what we're going to do is instead of using the test set to select the model, we're instead goging to use the validation set, or the cross validation set, to select the model. Concretely, we're going to first take our first hypothesis, take the first model and minimize the cost function, and this would give me some parameter vector $\theta$ for the linear model. And, as before, I'm going to put a superscript (1) just to denote that this is the parameter for the linear model. We do the same thing for the quadratic model. Get some parameter vector $\theta ^{(2)}$ . Get some parameter vector $\theta ^{(3)}$ , and so on, down to say the ten-fold polynomial. And what I'm going to do is, instead of testing these hypotheses on the test set, I'm going to test them on the cross validation set. I will measure $J_{cv}$ to see how well each of these hypotheses do on my cross validation set. And then I'm going to pick the hypothesis with the lowest cross validation error. So, for this example, let's say for the sake of argument, that it was my 4th order polynomial that had the lowest cross validation error. So in that case, I'm going to pick this 4th order polynomial model. And finally, what this means is that that parameter , remember was the degree of polynomial. What we've done is we fit that parameter , and we'll set equals 4. And we did so using the cross validation set. And so this degree of polynomial, so the parameter, is no longer fit to the test set. And so we've now saved away the test set, and we can use the test set to measure or to estimate the generalization error of the model that was selected by the algorithm.

So, that was model selection and how you can take your data split it into a training, validation and test set. And use your cross validation data to select the model and evaluate it on the test set. One final note, I should say that in the machine learning, as of its practice today, there are many people that will do that early thing that I talked about, and said that, it isn't such a good idea of selecting your model using this test set, and then reporting the error on the test set as though that were a good estimate of generalization error. That sort of practice is unfortunately many people do it. If you have a massive test set, that is maybe not a terrible thing to do, but many practitioners of machine learning tend to advise again that. And it's considered better practice to have separate training, validation and test sets. As I just warned you, sometimes people do, use the same data for the purpose of the validation set, and for the purpose of the test set. You only have a training set and a test set, and that's considered a good practice, you will see some people do it. But, if possible, I would recommend against doing that yourself.

<end>