YOLO-YOLOv5寻找训练问题与《停止用垃圾投喂你的模型》荐读-记录

前言:之前用yolov5训练的模型,效果不大行,开始寻找问题所在,寻求改进。

任务:了解前两次训练模型问题

状态:基本完成

使用体验来看问题:

问题的解决源自问题的提出与定位,而一切问题的发现又来源于使用,所以先体验一下我自己的模型:

首先,调用摄像头实时识别,结果很抱歉,很多手势识别不出来,而识别出来也容易识别错误,容易识别成door,即便不是。而不是手势的那些也往往识别成手势。图片也是如此,视频不再额外测试了。自己的图片很容易出错。

然后,用了下买的源数据集的图片,发现对方的数据集的val集的识别效果很好,不出错。

综合来看,只对于他自己的图像好用,对于我的图像的泛化能力很差,显然是过拟合了。

猜想与验证:

我又把对方的图片观察了下,发现对方的数据集基本上都是白色墙壁,黑色衣服背景下的手势。于是,有没有可能是环境导致的泛化能力极差呢?
在这里插入图片描述

由于手势都是直接位于黑色衣物下的,所以,我将本地背景换成黑色的,再识别。尝试后发现,识别率和精确率大大上升,也就是能识别到有手势,并能识别正确手势情况的比例大大增加。果然,背景单一,导致黑色被我训练的手势识别模型认为是手势的主要特征了。对方的数据集看来是个比较糟糕的数据集。

收获:通过对问题原因,即对方手势环境背景,的确认,认识到了目标识别问题特征的含义。特征是目标的重要的结构性的组成部分,是目标之所以为目标而不是其他事物的区分。

tips:我认为,考虑问题似乎可以从为什么把人识别成door,转变为,为什么识别不出手势。也就是将考虑的对象牢牢的钉在手势上,因为这是手势识别。这样有助于厘清思路。

既然我目前对于什么样的数据集是好的或坏的的认知时不清晰不够的,那么,下一步就去学习什么是好的数据集吧。

学习什么样的数据集是好的数据集:《停止用垃圾投喂你的模型》:

学习对象:《Stop Feeding Garbage To Your Model! — The 6 biggest mistakes with datasets and how to avoid them》https://hackernoon.com/stop-feeding-garbage-to-your-model-the-6-biggest-mistakes-with-datasets-and-how-to-avoid-them-3cb7532ad3b7(需要科学上网),真是句句珠玑,令人膜拜。

Introduction

If you haven’t heard it already, let me tell you a truth that you should, as a data scientist, always keep in a corner of your head:

“Your results are only as good as your data.”

Many people make the mistake of trying to compensate for their ugly dataset by improving their model. This is the equivalent of buying a supercar because your old car doesn’t perform well with cheap gasoline. It makes much more sense to refine the oil instead of upgrading the car. In this article, I will explain how you can easily improve your results by enhancing your dataset.

非常有趣的一句话:“你的模型好坏和你的数据集一样”,这大概让我明白了什么是当前工作的中心了。数据集,在某种意义上,意味着一切了。

1.Not enough data数据不够

If your dataset is too small, your model doesn’t have enough examples to find discriminative features that will be used to generalize. It will then overfit your data, resulting in a low training error but a high test error.

Solution #1: gather more data. You can try to find more from the same source as your original dataset, or from another source if the images are quite similar or if you absolutely want to generalize.

Caveats: This is usually not an easy thing to do, at least without investing time and money. Also, you might want to do an analysis to determine how much additional data you need. Compare your results with different dataset sizes, and try to extrapolate.

Solution #2: augment your data by creating multiple copies of the same image with slight variations. This technique works wonders and it produces tons of additional images at a really low cost. You can try to crop, rotate, translate or scale your image. You can add noise, blur it, change its colors or obstruct parts of it. In all cases, you need to make sure the data is still representing the same class.
在这里插入图片描述

All this images still represent the “cat” category

This can be extremely powerful, as stacking these effects gives exponentially numerous samples for your dataset. Note that this is still usually inferior to collecting more raw data.
在这里插入图片描述

Combined data augmentation techniques. The class is still “cat” and should be recognized as such.

Caveats: all augmentations techniques might not be usable for your problem. For example, if you want to classify Lemons and Limes, don’t play with the hue, as it would make sense that color is important for the classification.
在这里插入图片描述

This type of data augmentation would make it harder for the model to find discriminating features.

数据不够,会使得模型找不到有结构性的区别性的特征来泛化分类,train集错误低,val集错误高,出现过拟合的问题。

方法一:弄到更多的图片数据,不过要花时间和金钱,可能难以实现,考虑到需求的数据数量(指以万或十万张图片才能达成目标)。如果要弄得话,最好先根据不同数量级别的训练性能走向来评估下额外还需要多少张。

方法二:是我此前提到的数据增强。对一张图像进行放缩,裁剪,旋转,噪点,模糊,遮挡得到有相同标注的副本。这很有用,短时间得到指数巨量的数据。不过效果较采集额外数据会差些。

tips:你进行数据增强得保证不影响用来区分对象的特征,比如lemon和lime就是通过颜色好区分,你进行hue增强,这不给自己找不自在呢。

2.Low quality classes分类质量差

It’s an easy one, but take time to go through your dataset if possible, and verify the label of each sample. This might take a while, but having counter-examples in your dataset will be detrimental to the learning process.

Also, choose the right level of granularity for your classes. Depending on the problem, you might need more or less classes. For example, you can classify the image of a kitten with a global classifier to determine it’s an animal, then run it through an animal classifier to determine it’s a kitten. A huge model could do both, but it would be much harder.
在这里插入图片描述

Two stage prediction with specialized classifiers.

首先,尽可能把你的数据集一张一张过一遍,判断标注是否良好,是很有意义的。

另外,选择合适的分类规模,比如猫的分类,先判断属于动物的分类,然后再通过动物分类器得到猫的结果。可以这样两阶段用不同的专用分类器分类。比一个通用复杂模型容易很多。

3.Low quality data低质量数据/图片

As said in the introduction, low quality data will only lead to low quality results.

You might have samples in your dataset that are too far from what you want to use. These might be more confusing for the model than helpful.

Solution: remove the worst images. This is a lengthy process, but will improve your results.
在这里插入图片描述

Sure, these three images represent cats, but the model might not be able to work with it.

Another common issue is when your dataset is made of data that doesn’t match the real world application. For instance if the images are taken from completely different sources.

Solution: think about the long term application of your technology, and which means will be used to acquire data in production. If possible, try to find/build a dataset with the same tools.
在这里插入图片描述

Using data that doesn’t represent your real world application is usually a bad idea. Your model is likely to extract features that won’t work in the real world.

低质量的数据只能得到低质量的结果。将那些低质量的图片移除,即便这是个耗时的活,但可以改善训练结果。

给出的三张猫猫图片是不适合分类的。(我个人认为可以说是对特征结构的干扰。)

另外,可以从你的项目长期的实际生产应用中获取的数据的风格形式来考虑选择的数据集。可能的话,使用同一个工具创建数据集。

4.Unbalanced classes不均衡的类别

If the number of sample per class isn’t roughly the same for all classes, the model might have a tendency to favor the dominant class, as it results in a lower error. We say that the model is biased because the class distribution is skewed. This is a serious issue, and also why you need to take a look at precision, recall or confusion matrixes.

Solution #1: gather more samples of the underrepresented classes. However, this is often costly in time and money, or simply not feasible.

Solution #2: over/under-sample your data. This means that you remove some samples from the over-represented classes, and/or duplicate samples from the under-represented classes. Better than duplication, use data augmentation as seen previously.
在这里插入图片描述

We need to augment the under-represented class (cat) and leave aside some samples from the over-represented class (lime). This will give a much smoother class distribution.

如果训练集中的各类别张数不是大体相同的,那么模型可能会有对某一特定类别的偏好。因为模型如果推断图片属于数量多的类别,更少出错。模型具有偏向性是很严重的问题,可以通过precision,recall和confusion matrixes看出。

方案一:通过采集增加量相对少的类别的图片数量。

方案二:通过增强增加量相对少的类别的图片数量,通过删除减少被过分代表类的图片的数量。

最终使得各个类别的图片作用是更平衡的。

5.Unbalanced data格式不统一的数据

If your data doesn’t have a specific format, or if the values don’t lie in the certain range, your model might have trouble dealing with it. You will have better results with image that are in aspect ratio and pixel values.

Solution #1: Crop or stretch the data so that it has the same aspect or format as the other samples.
在这里插入图片描述

Two possibilities to improve a badly formatted image.

你的数据要有相同的格式,否则你将很难处理它们。如果你的图片格式是一致的,你将能得到更好的训练结果。比如图片的长宽比。

tips:我用的yolov5本身已经进行了这部步处理。

6.No validation or testing没有验证集或测试集

Once your dataset has been cleaned, augmented and properly labelled, you need to split it. Many people split it the following way: 80% for training, and 20% for testing, which allow you to easily spot overfitting. However, if you are trying multiple models on the same testing set, something else happens. By picking the model giving the best test accuracy, you are in fact overfitting the testing set. This happens because you are manually selecting a model not for its intrinsic value, but for its performance on a specific set of data.

Solution: split the dataset in three: training, validation and testing. This shields your testing set from being overfitted by the choice of the model. The selection process becomes:

1.Train your models on the training set.

2.Test them on the validation set to make sure you aren’t overfitting.

3.Pick the most promising model. Test it on the testing set, this will give you the true accuracy of your model.
在这里插入图片描述

Note: Once you have chosen your model for production, don’t forget to train it on the whole dataset! The more data the better!

不要在测试集上验证是否过拟合。因为这也是训练即拟合数据的过程。这可能会导致模型对测试集本身过拟合,导致测试结果出问题。所以,测试集只测试一次。不过,对于他的文章,我其实不是很理解,可以参考另一位老哥的文章:数据集三部分更好理解一些:

“为什么将数据集划分为三部分?

在深度学习中,将数据集分为训练集、验证集和测试集非常重要。

这样做的主要原因是评估模型在未见数据上的性能。通过将数据分成不同的集合,模型可以在训练集上进行训练,在验证集上进行调整超参数,最后在测试集上进行评估,以估计其泛化性能。

训练集用于通过优化算法(如反向传播)调整模型的参数来训练模型。

验证集用于调整模型的超参数,例如学习率、隐藏层数、每层神经元数等,以提高模型在验证集上的性能。

测试集用于评估模型在超参数在验证集上调整后的最终性能。它作为模型在未见数据上表现如何的估计。

将数据分成这三个集合有助于防止过拟合,当模型在训练集上表现良好但在未见数据上表现不佳时,就会发生过拟合。通过在单独的验证和测试集上评估模型,我们可以确保模型没有过拟合,并能够泛化到未见数据。

……”

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值