Machine learning system design - Trading off precision and recall

So in this video, we talked about the notion of trading off between precision and recall and how we can vary the threshold that we use to decide whether to predict y=1 or y=0. This threshold that says do we need to be at least 70% confident or 90% confident or whatever before we predict y=1 and by varying the threshold you can control a trade off between precision and recall. Then talked about the F score which takes precision and recall and gives you a single real number evaluation metric. And of course, if your goal is to automatically set that threshold, one pretty reasonable way to do that would also be to try a range different values of thresholds. And evaluate these different thresholds on, say, your cross validation set, and then to pick whatever value of threshold gives you the highest F score on your cross validation set. That would be a pretty reasonable way to automatically choose the threshold for your classifier as well.
————————————————

In the last video, we talked about precision and recall as an evaluation metric for classification problems with skew classes. For many applications, we'll want to somehow control the trade off between precision and recall. Let me tell you how to do that and also show you some, even more effective ways to use precision and recall as an evaluation metrics for learning algorithm.

As a reminder, here are the definitions of precision and recall from the previous video. Let's continue our cancer classification example where y=1 if the patient has cancer and y=0 otherwise. And let's say we've trained a logistic regression classifier, which outputs probabilities between zero and one. As usual, we're going to predict y=1 if h_{\theta }(x)>=0.5. Predict y=0 if h_{\theta }(x)<0.5. And this classifier may give us some value for precision and some value for recall. But now, suppose we want to predict that a patient has cancer only if we're very confident that they really do. Because you know if you go to a patient and tell them that they have cancer, it's going to give them a huge shock because this is seriously bad news. And they may end up going through a pretty painful treatment process and so on. So maybe we want to tell someone that we think they have cancer only if they're very confident. One way to do this would be modify the algorithm, so that instead of setting the threshold at 0.5, we might instead say that we'll predict that y=1 only if h_{\theta }(x)>=0.7. This means only tell someone they have cancer only if we think there's a greater than or equal to 70% chance that they have cancer. And if you do this, then you're predicting someone has cancer only when you're more confident. So you end up with a classifier that has higher precision, because all the patients you're going to and say, we think you have cancer, all of these patients are now pretty confident they actually have cancer. But in contrast, this classifier will have lower recall because now we're going to predict y=1 on a smaller number of patients. We can even take this further. Instead of setting the threshold at 0.7, we can set this at 0.9 and we predict y=1 only if we're more than 90% certain that the patient has cancer. Now consider a different example. Suppose we want to avoid missing too many actual cases of cancer. So we want to avoid the false negatives. And particular, if a patient actually has cancer, but we failed to tell them that they have cancer, then that could be really bad. In this case, rather than setting higher probability threshold, we might instead take this value and set it to a lower value, maybe 0.3 like so. By doing so, we're saying that if we think there's more than 30% chance that they have a cancer, we'd better be more conservative and tell them that they may have cancer, so they can seek treatment if ncessary. In this case, what we would have is going to be a higher recall classifier. because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer, but we're going to end up with lower prescision becauser the higher fraction of the patients that we said have caner, the higher fraction of them will turn out not to have cancer at all. And by the way, just as an aside, when I talk about this to other students up until before, it's pretty amazing. Some of my students say is how I can tell the story both ways. Why we might want to have higher precision or higher recall and the story actually seems to work both ways. But I hope the details of the algorithm is true and the more general principle is depending on where you want, whether you want higher precision lower recall or higher recall lower precision. You can end up predicting y=1 when h_{\theta }(x)>= is greater than some threshold. And so in general, for most classifier, there is going to be a trade off between precision and recall. And as you vary the value of this threshold, you can actually plot some curve that trades off precision and recall. Where a value up here, this would correspond to a very high value of the threshold, maybe threshold equals 0.99. So that will be higher precision relatively lower recall. Whereas the point down here will correspond to a value of the threshold that's much lower, maybe 0.01. You'll end up a much lower precision and higher recall classifier. And as you vary the threshold, if you want, you can trace all the curve from your classifier to see the range of different values you can get for precision and recall. By the way, the precision recall curve can look like many different shapes. Sometimes look like this, sometimes look like that. So this raises another interesting question which is, is there a way to choose this threshold automatically?

Or more generally, if we have a few different algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers? Concretely, suppose we have three different learning algorithms, or actually maybe these are three different learning algorithms, maybe these are the same algorithm but just with different values for the threshold. How do we decide which of these algorithms is the best? One of the things we talked about earlier is the importance of a single real number evaluation metric. And that is the idea of having a number that just tells you how well is your classifier doing. But by switching to the precision recall metric, we've actually lost that. We now have two real numbers. And so we often end up facing situations, like if we're trying to compare algorithm 1 to algorithm 2, we end up asking ourselves, is a precision of 0.5 and a recall of 0.4, is that better or worse than a precision of 0.7 and a recall of 0.1? If every time you try on a new algorithm you end up having to sit around and think, well, maybe 0.5 and 0.4 is better than 0.7 and 0.1, maybe not, I don't know. If you end up having to sit around and think and make these decisions, that really slows down your decision making process, for what things, what changes are useful to incorporate into your algorithm. Whereas in contrast, if we had a single real number evaluation metric, like a number that just tells us whether algorithm 1 or algorithm 2 is better. That helps us much more quickly to decide which algorithm to go with and helps us as well to much more quickly to evaluate different changes that we may be contemplating for an algorithm. So, how can we get a single real number evaluation metric? One natural thing you might try is to look at the average between precision and recall. So using p and r to denote precision and recall, what you could do is just compute the average and look at what classifier has the highest average value. But this turns out not to be such a good solution because, similar to the example we had earlier, it turns out that if we have a classifier that predicts y=1 all the time, then if you do that, you get a very high recall. But you end up with a very low value of precision. Conversely, if you have a classifier that predicts y=0 almost all the time, that is, if it predicts y=1 very sparingly. This corresponds to setting a very high threshold using the notation of previous line. And you can actually end up with a very high precision but with very low recall. So the two extremes of either a very high threshold or a very low threshold, neither of that would give a particularly good classifier. And the way we recognize that is by seeing if we end up with a very low precision or a very low recall. If you just take the average of p and r, the average is actually the highest for algorithm 3 even though you can get that sort of performance by predicting y=1 all the time. So algorithm 1 and 2 will be more useful than algorithm 3. But in this example, algorithm 3 has a higher average value of precision recall than algorithm 1 and 2. So we usually think of this average of precision recall as not a particularly good way to evaluate our learning algorithm. In contrast, there is a different way of combining precision recall. It is called the F score and it uses that formula. So, in this example, here are the F scores. And so we would tell from these F scores algorithm 1 has the highest F score. Algorithm 2 has the second highest and algorithm 3 has the lowest. So if we go by the F score, we would pick probably algorithm 1 over the others. The F score, which is also called the F1 score, is usually written F1 score that I have here, but often people will just say F score. It's a little bit like taking the average of precision and recall, but it gives the lower value of precision and recall, whichever it is, a higher weight. And so, you see in the numerator here that the F score takes a product of precision and recall. If either precision is 0 or recall is 0, the F score will be 0. So in that sense, it kind of combines precision and recall. But for the F score to be large, both precision and recall have to be pretty large. I should say that there are many different possible formulas for combining precision and recall. This F score formula is maybe just one out of a much larger number of possibilities, but historically or traditionally this is what people in machine learning use. And the term F score, it doesn't really mean anything. So don't worry about why it's called F score or F1 score. But this usually gives you the effect that you want because if either precision is 0 or recall is 0, this gives you a very low F score. So, if p=r=0, then F1 score equals 0. If p=r=1, then F1 score equals 1. And intermediate values between 0 and 1, this usually gives a reasonable rank ordering of different classifiers.

So in this video, we talked about the notion of trading off between precision and recall and how we can vary the threshold that we use to decide whether to predict y=1 or y=0. This threshold that says do we need to be at least 70% confident or 90% confident or whatever before we predict y=1 and by varying the threshold you can control a trade off between precision and recall. Then talked about the F score which takes precision and recall and gives you a single real number evaluation metric. And of course, if your goal is to automatically set that threshold, one pretty reasonable way to do that would also be to try a range different values of thresholds. And evaluate these different thresholds on, say, your cross validation set, and then to pick whatever value of threshold gives you the highest F score on your cross validation set. That would be a pretty reasonable way to automatically choose the threshold for your classifier as well.

<end>

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值