训练集和测试集数据分布不同
by Nezar Assawiel
由Nezar Assawiel
当您的训练和测试数据来自不同的分布时该怎么办 (What to do when your training and testing data come from different distributions)
To build a well-performing machine learning (ML) model, it is essential to train the model on and test it against data that come from the same target distribution.
要构建性能良好的机器学习(ML)模型,必须对模型进行训练并针对来自相同目标分布的数据进行测试。
However, sometimes only a limited amount of data from the target distribution can be collected. It may not be sufficient to build the needed train/dev/test sets.
但是,有时只能从目标分发中收集有限数量的数据。 构建所需的训练/开发/测试集可能还不够。
Yet similar data from other data distributions might be readily available. What to do in such a case? Let us discuss some ideas!
然而,其他数据分布中的相似数据可能随时可用。 在这种情况下该怎么办? 让我们讨论一些想法!
一些背景知识 (Some background knowledge)
To better follow the discussion here, you can read up on the following basic ML concepts, if you are not familiar with them already:
为了更好地遵循此处的讨论,如果您还不熟悉以下基本ML概念,则可以阅读它们:
Train, dev (development), and test sets: Note that the dev set is also called the validation or the hold-on set. This post is a good short introduction to the topic.
训练,开发(开发)和测试集:请注意,开发集也称为验证集或保留集。 这篇文章是对该主题的很好的简短介绍。
Bias (underfitting) and variance (overfitting) errors: This is a great simple explanation of these errors.
偏差(拟合不足)和方差(拟合过度)错误: 这是对这些错误的简单解释。
How the train/dev/test split is correctly made: You may refer to this post that I have written before for a short background on this topic.
如何正确进行训练/开发/测试拆分:您可以参考我的这篇文章 之前写过关于该主题的简短背景。
情境 (Scenario)
Say you are building a dog-image classifier application that determines if an image is of a dog or not.
假设您正在构建一个狗图像分类器应用程序,该应用程序将确定图像是否属于狗。
The application is intended for users in rural areas who can take pictures of animals by their mobile devices for the application to classify the animals for them.
该应用程序适用于农村地区的用户,他们可以通过其移动设备为动物拍照,以便为他们分类动物。
Studying the target data dis