Coursera | Andrew Ng (03-week2-2.2)—清除标注错误的数据

最新推荐文章于 2020-10-28 15:46:26 发布

ZJ_Improve

最新推荐文章于 2020-10-28 15:46:26 发布

阅读量2k

点赞数 1

分类专栏：深度学习 | 吴恩达-03.结构化机器学习项目深度学习 | 吴恩达文章标签：清楚错误分类数据集错误分析机器学习

本文链接：https://blog.csdn.net/JUNJUN_ZHAO/article/details/79167410

版权

深度学习 | 吴恩达同时被 2 个专栏收录

129 篇文章 19 订阅

订阅专栏

深度学习 | 吴恩达-03.结构化机器学习项目

22 篇文章 1 订阅

订阅专栏

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79167410

2.2 Cleaning up incorrectly labeled data (清除标注错误的数据)

(字幕来源：网易云课堂)

这里写图片描述

The data for your supervised learning problem comprises input X and output labels Y.What if you going through your data and you find that some of these output labels Y are incorrect,you have data which is incorrectly labeled?Is it worth your while to go in to fix up some of these labels? Let’s take a look.In the cat classification problem,Y equals one for cats and zero for non cats.So, let’s say you’re looking through some data and that’s a cat,that’s not a cat, that’s a cat,that’s a cat, that’s not a cat, that’s at a cat.No, wait. That’s actually not a cat.So this is an example with an incorrect label.So I’ve used the term, mislabeled examples,to refer to if your learning algorithm outputs the wrong value of Y.But I’m going to say, incorrectly labeled examples,to refer to if in the data set you have in the training set or the dev set or the test set, the label for Y,whatever a human label assigned to this piece of data is actually incorrect.And that’s actually a dog so that Y really should have been zero.But maybe the labeler got that one wrong.So if you find that your data has some incorrectly labeled examples,what should you do?

这里写图片描述

你的监督学习问题的数据由输入 X 和输出标签 Y 构成，如果你观察一下你的数据并发现有些输出标签 Y 是错的，这些输出标签 Y 是错的，你的数据有些标签是错的，是否值得花时间去修正这些标签呢? 我们看看，在猫分类问题中，图片是猫 Y=1 不是猫 Y=0，所以假设你看了一些数据样本这是猫，这不是猫这是猫，那是猫那不是猫那是猫，等等这其实不是猫，所以这是标记错误的例子，我用了这个词标记错误的例子，来表示你的学习算法输出了错误的 Y 值，但我要说的是对于标记错误的例子，参考你的数据集，在训练集或者测试集 Y 的标签，人类给这部分数据加的标签，实际上是错的，这实际上是一只狗。所以 Y 其实应该是 0，也许做标记的那人疏忽了，如果你发现你的数据有一些标记错误的例子，你该怎么办?

Well, first, let’s consider the training set.It turns out that deep learning algorithms are quite robust to random errors in the training set.So long as your errors or your incorrectly labeled examples,so long as those errors are not too far from random,maybe sometimes the labeler just wasn’t paying attention or they accidentally,randomly hit the wrong key on the keyboard.If the errors are reasonably random,then it’s probably okay to just leave the errors as they are and not spend too much time fixing them.There’s certainly no harm to going into your training set and be examining the labels and fixing them.Sometimes that is worth doing but your effort might be okay even if you don’t.So long as the total data set size is big enough and the actual percentage of errors is maybe not too high.So I see a lot of machine learning algorithms that trained even when we know that there are few X mistakes in the training set labels and usually works okay.There is one caveat to this which is that deep learning algorithms are robust to random errors.They are less robust to systematic errors.So for example, if your labeler consistently labels white dogs as cats,then that is a problem because your classifier willlearn to classify all white colored dogs as cats.But random errors or near random errors are usually not too bad for most deep learning algorithms.Now, this discussion has focused on what to do about incorrectly labeled examples in your training set.How about incorrectly labeled examples in your dev set or test set?

这里写图片描述

首先我们来考虑训练集，事实证明深度学习算法，对于训练集中的随机错误是相当健壮的，只要你的标记出错的例子，只要这些错误例子离随机错误不太远，有时可能做标记的人没有注意或者不小心，按错键了，如果错误足够随机，那么放着这些错误不管可能也没问题，而不要花太多时间修复它们，当然你浏览一下训练集，检查一下这些标签并修正它们也没什么害处，有时候修正这些错误是有价值的有时候放着不管也可以，只要总数据集总足够大，实际错误率可能不会太高，我见过一大批机器学习算法训练的时候，明知训练集里有X个错误标签但最后训练出来也没问题，有我这里先警告一下，深度学习算法对随机误差很健壮，但对系统性的错误就没那么健壮了，所以比如说如果做标记的人一直把白色的狗标记成猫，那就成问题了因为你的分类器，学习之后会把所有白色的狗都分类为猫，但随机错误或近似随机错误，对于大多数深度学习算法来说不成问题，现在之前的讨论集中在，训练集中的标记出错的例子，那么如果是开发集和测试集中有这些标记出错的例子呢?

If you’re worried about the impact of incorrectly labeled examples on your dev set or test set,what they recommend you do is during error analysis to add one extra column so that you can also count up the number of examples where the label Y was incorrect.So for example, maybe when you count up the impact on a 100 mislabeled dev set examples,so you’re going to find a 100 examples where your classifier’s output disagrees with the label in your dev set.And sometimes for a few of those examples,your classifier disagrees with the label because the label was wrong,rather than because your classifier was wrong.So maybe in this example,you find that the labeler missed a cat in the background.So put the check mark there to signify that example 98 had an incorrect label.And maybe for this one,the picture is actually a picture of a drawing of a cat rather than a real cat.Maybe you want the labeler to have labeled that Y equals zero rather than Y equals one.And so put another check mark there.And just as you count up the percent of errors due to other categories like we saw in the previous video,you’d also count up the fraction of percentage of errors due to incorrect labels.Where the Y value in your dev set was wrongand that accounted for why your learning algorithmmade a prediction that differed from what the label on your data says.

这里写图片描述

如果你担心开发集或测试集上标记出错的例子带来的影响，他们一般建议你在错误分析时，添加一个额外的列这样你也可以统计，标签 Y 错误的例子数，所以比如说也许你统计一下，对 100 个标记出错的例子的影响，所以你会找到 100个例子，其中你的分类器的输出和开发集的标签不一致，有时对于其中的少数例子，你的分类器输出和标签不同是因为标签错了，而不是你的分类器出错，所以也许在这个例子中，你发现标记的人漏了背景里的一只猫，所以那里打个勾来表示例子 98 标签出错了，也许这张图，实际上是猫的画而不是一只真正的猫，也许你希望标记数据的人将它标记为Y=0 而不是Y=1，然后再在那里打个勾，当你统计出其他错误类型的百分比后，就像我们在之前的视频中看到的那样，你还可以统计因为标签错误所占的百分比，你的开发集里的Y值是错的，这就解释了为什么你的学习算法，做出和数据集里的标记不一样的预测 1。

So the question now is,is it worthwhile going in to try to fix up this 6% of incorrectly labeled examples.My advice is, if it makes a significant difference to your ability to evaluate algorithms on your dev set,then go ahead and spend the time to fix incorrect labels.But if it doesn’t make a significant difference to your ability to use the dev set to evaluate cost bias,then it might not be the best use of your time.Let me show you an example that illustrates what I mean by this.So, three numbers I recommend you look at to try to decide if it’s worth going in and reducing the number of mislabeled examples are the following.I recommend you look at the overall dev set error.And so in the example we had from the previous video,we said that maybe our system has 90% overall accuracy.So 10% error.Then you should look at the number of errors or the percentage of errors that are due to incorrect labels.So it looks like in this case,6% of the errors are due to incorrect labels.So 6% of 10% is 0.6%.And then you should look at errors due to all other causes.So if you made 10% error on your dev setand 0.6% of those are because the labels is wrong,then the remainder, 9.4% of them,are due to other causes such as misrecognizing dogs being cats,great cats and their images.So in this case, I would say there’s 9.4% worth of error that you could focus on fixing,whereas the errors due to incorrect labels isa relatively small fraction of the overall set of errors.So by all means,go in and fix these incorrect labels if you wantbut it’s maybe not the most important thing to do right now.

这里写图片描述

所以现在问题是，是否值得修正这 6% 标记出错的例子，我的建议是如果这些标记错误，严重影响了你在开发集上评估算法的能力，那么就应该去花时间修正错误的标签，但是如果它们没有严重影响到，你用开发集评估成本偏差的能力，那么可能就不应该花宝贵的时间去处理，我给你看一个例子解释清楚我的意思，所以我建议你看 3 个数字来确定，是否值得去人工修正标记出错的数据，我建议你看看整体的开发集错误率，在我们以前的视频中的例子，我们说也许我们的系统达到了 90% 整体准确度，所以有 10% 错误率，那么你应该看看错误标记引起的，错误的数量或者百分比，所以在这种情况下，6％的错误来自标记出错，所以 10% 的 6% 就是 0.6%，也许你应该看看其他原因导致的错误，如果你的开发集上有 10% 错误，其中 0.6% 是因为标记出错，剩下的占 9.4%，是其他原因导致的比如把狗误认为猫，大猫图片，所以在这种情况下我说有 9.4% 错误率需要集中精力修正，而标记出错导致的错误，是总体错误的一小部分而已，所以如果你一定要这么做，你也可以手工修正各种错误标签，但也许这不是当下最重要的任务。

Now, let’s take another example.Suppose you’ve made a lot more progress on your learning problem.So instead of 10% error,let’s say you brought the errors down to 2%,but still 0.6% of your overall errors are due to incorrect labels.So now, if you want to examine a set of mislabeled dev set images,set that comes from just 2% of dev set data you’re mislabeling,then a very large fraction of them, 0.6 divided by 2%,so that is actually 30% rather than 6% of your labels.Your incorrect examples are actually due to incorrectly label examples.And so errors due to other causes are now 1.4%.When such a high fraction of your mistakes as these are measured on your dev set due to incorrect labels,then it maybe seems much more worthwhile to fix up the incorrect labels in your dev set.And if you remember the goal of the dev set,the main purpose of the dev set is,you want to really use it to help you select between two classifiers A and B.So you’re trying out two classifiers A and B,and one has 2.1% error and the other has 1.9% error on your dev set.But you don’t trust your dev set anymore to be correctly telling you whether this classifier is actually better than this because your 0.6% of these mistakes are due to incorrect labels.Then there’s a good reason to go in and fix the incorrect labels in your dev set.Because in this example on the right is just having a very large impacton the overall assessment of the errors of the algorithm,whereas example on the left,the percentage impact is having on your algorithm is still smaller.

这里写图片描述

我们再看另一个例子，假设你在学习问题上取得了很大进展，所以现在错误率不再是 10% 了，假设你把错误率降到了 2％，但总体错误中的 0.6% 还是标记出错导致的，所以现在如果你想检查一组标记出错的开发集图片，开发集数据有 2% 标记错误了，那么其中很大一部分 0.6 除以 2%，实际上变成 30% 标签而不是 6% 标签了，有那么多错误例子其实是因为标记出错导致的，所以现在其他原因导致的错误是 1.4%，当测得的那么大一部分的错误，都是开发集标记出错导致的，那似乎修正开发集里的错误标签似乎更有价值，如果你还记得设立开发集的目标的话，开发集的主要目的是，你希望用它来从两个分类器 A 和 B中选择一个，所以当你测试两个分类器 A 和 B 时，在开发集上一个有 2.1% 错误率另一个有 1.9% 错误率，但是你不能再信任开发集了因为它无法告诉你，这个分类器是否比这个好，因为 0.6% 的错误率是标记出错导致的，那么现在你就有很好的理由去修正开发集里的错误标签，因为在右边这个例子中标记出错，对算法错误的整体评估标准有严重的影响，而左边的例子中，标记出错对你算法影响的百分比还是相对较小的。

Now, if you decide to go into a dev set and manually re-examine the labels and try to fix up some of the labels,here are a few additional guidelines or principles to consider.First, I would encourage you to applywhatever process you apply to both your dev and test sets at the same time.We’ve talk previously about why you wantto dev and test sets to come from the same distribution.The dev set is tagging you into target and when you hit it,you want that to generalize to the test set.So your team really works more efficiently to dev and test sets come from the same distribution.So if you’re going in to fix something on the dev set,I would apply the same process to the test setto make sure that they continue to come from the same distribution.So we hire someone to examine the labels more carefully.Do that for both your dev and test sets.Second, I would urge you to consider examining examples your algorithm got right as well as ones it got wrong.It is easy to look at the examples your algorithm got wrong and just see if any of those need to be fixed.But it’s possible that there are some examples that you haven’t got right,that should also be fixed.And if you only fix ones that your algorithms got wrong,you end up with more bias estimates of the error of your algorithm.It gives your algorithm a little bit of an unfair advantage.We just try to double check what it got wrong but you don’t also double check what it got rightbecause it might have gotten something right,that it was just lucky onfixing the label would cause it to go from being right to being wrong, on that example.The second bullet isn’t always easy to do,so it’s not always done.The reason it’s not always done is because if you classifier’s very accurate,then it’s getting fewer things wrong than right.So if your classifier has 98% accuracy,then it’s getting 2% of things wrong and 98% of things right.So it’s much easier to examine and validate the labels on 2% of the dataand it takes much longer to validate labels on 98% of the data,so this isn’t always done.That’s just something to consider.

这里写图片描述

现在如果你决定要去修正开发集数据，手动重新检查标签并尝试修正一些标签，这里还有一些额外的方针和原则需要考虑，首先我鼓励你 不管用什么修正手段，都要同时作用到开发集和测试集上，我们之前讨论过为什么，开发和测试集必须来自相同的分布，开发集确定了你的目标当你击中目标后，你希望算法能够推广到测试集上，这样你的团队能够更高效的，在来自同一分布的开发集和测试集上迭代，如果你打算修正开发集上的部分数据，那么最好也对测试集做同样的修正，以确保它们继续来自相同的分布，所以我们雇佣了一个人来仔细检查这些标签，但必须同时检查开发集和测试集，其次我强烈建议你要考虑，同时检验算法判断正确和判断错误的例子，要检查算法出错的例子很容易，只需要看看那些例子是否需要修正，但还有可能有些例子算法没判断对，那些也需要修正，如果你只修正算法出错的例子，你对算法的偏差估计可能会变大，这会让你的算法有一点不公平的优势，我们就需要再次检查出错的例子，但也需要再次检查做对的例子，因为算法有可能因为运气好，把某个东西判断对了，在那个特例里修正那些标签可能会让算法从判断对变成判断错，这第二点不是很容易做，所以通常不会这么做，通常不会这么做的原因是如果你的分类器很准确，那么判断错的次数比判断正确的次数要少得多，所以如果你的分类器有 98％的准确度，那么就有 2% 出错 98% 都是对的，所以更容易检查 2％数据上的标签，然而检查 98％数据上的标签要花的时间长得多，所以通常不这么做，但也是要考虑到的。

Finally, if you go into a dev and test data to correct some of the labels there,you may or may not decide to go and apply the same process for the training set.Remember we said that at this other video thatit’s actually less important to correct the labels in your training set.And it’s quite possible you decide to just correct the labels in your dev and test set which are also often smaller than a training set and you might not invest all that extra effort needed to correct the labels in a much larger training set.This is actually okay.We’ll talk later this week about some processesfor handling when your training datais different in distribution than you dev and test data.Learning algorithms are quite robust to that.It’s super important that your dev and test sets come from the same distribution.But if your training set comes from a slightly different distribution,often that’s a pretty reasonable thing to do.I will talk more about how to handle this later this week.

这里写图片描述

最后如果你进入到一个开发集和测试集去修正这里的部分标签，你可以会也可能不会去对训练集做同样的事情，还记得我们在其他视频里讲过，修正训练集中的标签其实相对没那么重要，你可能决定只修正开发集和测试集中的标签，因为它们通常比训练集小得多，你可能不想把所有额外的精力，投入到修正大得多的训练集中的标签，所以这样其实是可以的，我们将在本周晚些时候讨论一些步骤，用于处理你的训练数据，分布和开发与测试数据不同的情况，学习算法其实相当健壮，你的开发集和测试集来自同一分布非常重要，但如果你的训练集来自稍微不同的分布，通常这是一件很合理的事情，我会在本周晚些时候谈谈如何处理这个问题。

So I’d like to wrap up with just a couple of pieces of advice.First, deep learning researchers sometimes like to say things like,”I just fed the data to the algorithm.I trained in and it worked.”There is a lot of truth to that in the deep learning error.There is more of feeding data in algorithm and just training it and doing less hand engineering and using less human insight.But I think that in building practical systems,often there’s also more manual error analysis and more human insight that goes into the systems than sometimes deep learning researchers like to acknowledge.

最后我讲几个建议，首先深度学习研究人员有时会喜欢这样说，“我只是把数据提供给算法，我训练过了效果拔群“，这话说出了很多深度学习错误的真相，更多时候我们把数据喂给算法然后训练它，并减少人工干预减少使用人类的见解，但我认为在构造实际系统时，通常需要更多的人工错误分析更多的人类见解，来架构这些系统尽管深度学习的研究人员不愿意承认这点。

Second is that somehow I’ve seen some engineers and researchers be reluctant to manually look at the examples.Maybe it’s not the most interesting thing to do,to sit down and look at a 100 or a couple hundred examples to counter the number of errors.But this is something that I so do myself.When I’m leading a machine learning team and I want to understand what mistakes it is making,I would actually go in and look at the data myself and try to counter the fraction of errors.And I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.I find this a very good use of your time and I urge you to consider doing it if those machines are in your system and you’re trying to decide what ideas or what directions to prioritize things.So that’s it for the error analysis process.In the next video, I want to share a view of some thoughts on how error analysis fits in to how you might go about starting out on the new machine learning project.

第二不知道为什么我看一些工程师和研究人员，不愿意亲自去看这些例子，也许做这些事情很无聊，坐下来看 100 或几百个例子，来统计错误数量，但我经常亲自这么做，当我带领一个机器学习团队时，我想知道它所犯的错误，我会亲自去看看这些数据，尝试和一部分错误作斗争，我想就因为花了这几分钟，或者几个小时去亲自统计数据，真的可以帮你找到需要优先处理的任务，我发现花时间亲自检查数据非常值得，所以我强烈建议你们考虑去做，如果那些机器在你的系统中，然后你想确定应该优先尝试哪些想法或者哪些方向，这就是错误分析过程，在下一个视频中我想分享一下错误分析，是如何在启动新的机器学习项目中发挥作用的。

重点总结：

清除错误标记的样本

下面还是以猫和狗分类问题为例子，来进行分析。如下面的分类中的几个样本：

这里写图片描述

情况一：

深度学习算法对训练集中的随机误差具有相当的鲁棒性。

只要我们标记出错的例子符合随机误差，如：做标记的人不小心错误，或按错分类键。那么像这种随机误差导致的标记错误，一般来说不管这些误差可能也没有问题。

所以对于这类误差，我们可以不去用大量的时间和精力去做修正，只要数据集足够大，实际误差不会因为这些随机误差有很大的变化。

情况二：

虽然深度学习算法对随机误差具有很好的鲁棒性，但是对于系统误差就不是这样了。

如果做标记的人一直把如例子中的白色的狗标记成猫，那么最终导致我们的分类器就会出现错误了。

dev、test 中错误标记的情况：

如果在开发集和测试集中出现了错误标记的问题，我们可以在误差分析的过程中，增加错误标记这一原因，再对错误的数据进行分析，得出修正这些标记错误的价值。

这里写图片描述

修正开发、测试集上错误样例：

对开发集和测试集上的数据进行检查，确保他们来自于相同的分布。使得我们以开发集为目标方向，更正确地将算法应用到测试集上。
考虑算法分类错误的样本的同时也去考虑算法分类正确的样本。（通常难度比较大，很少这么做）
训练集和开发/测试集来自不同的分布。

个人理解：

0.清除标注错误的数据
1.错误数据所处数据集？训练集中的如何分析处理？开发测试集的又如何处理？
2.统计分析，视情况而定。严重影响，占比大，可处理，影响不严重，占比小，可以不管。
3.开发测试集，需来自同一分布。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（3-2）– 机器学习策略（2）

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

ZJ_Improve

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
Coursera | Andrew Ng (03-week2-2.2)—清除标注错误的数据

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ Coursera 课程 |deeplearning.ai |网易云课堂
复制链接

扫一扫