模型 标签数据 神经网络_大型神经网络和小数据的模型选择

本文探讨了在拥有大型神经网络和少量标签数据的情况下如何进行模型选择。内容涉及深度学习中的模型选择问题及其解决方法。
摘要由CSDN通过智能技术生成

模型 标签数据 神经网络

The title statement is certainly a bold claim, and I suspect many of you are shaking your heads right now.

标题声明无疑是一个大胆的主张,我怀疑你们中的许多人现在都在摇头。

Under the classical teachings of statistical learning, this contradicts the well-known bias-variance tradeoff. This theory defines a sweet-spot where, if you increase model complexity further, generalization error tends to increase (the typical U-shaped test error curve).

在统计学习的经典理论下,这与众所周知的偏差-方差折衷相矛盾。 该理论定义了一个最佳点,如果您进一步增加模型复杂度,则泛化误差趋于增加(典型的U形测试误差曲线)。

You would think this effect is more pronounced for small datasets where the number of parameters, p, is larger than the number of observations, n, but this is not necessarily the case.

您可能会认为这种影响在参数数p大于观察数n的 小型数据集中更为明显,但不一定是这种情况。

In a recent ICML 2020 paper by Deepmind (Bornschein, 2020), it was shown that one can train on a smaller subset of the training data while maintaining generalizable results, even for large overparameterized models. If this is true, we can reduce the computational overhead in model selection and hyperparameter tuning significantly.

Deepmind (Bornschein,2020)在最近的ICML 2020论文中表明,即使对于大型的超参数化模型,也可以在较小的训练数据子集上进行训练,同时保持可推广的结果。 如果是这样,我们可以大大减少模型选择和超参数调整的计算开销

Think for a moment regarding the implications of this. This could dramatically alter how we select optimal models or tune hyperparameters (for example in Kaggle competitions) since we can include significantly more models in our grid search (or the like).

考虑一下这一点的含义 。 这可能会极大地改变我们选择最佳模型或调整超参数的方式(例如在Kaggle竞赛中),因为我们可以在网格搜索(或类似方法)中包含更多模型。

Is this too good to be true? And how can we prove it?

这太好了以至于无法做到吗? 我们怎么证明呢?

Here are the main takeaways before we get started:

以下是开始之前的主要收获:

  • Model selection is possible using only a subset of your training data, thus saving computational resources (relative ranking-hypothesis)

    仅使用一部分训练数据即可进行模型选择 ,从而节省了计算资源 ( 相对排名假设 )

  • Large overparameterized neural networks can generalize surprisingly well (double descent)

    大型的超参数化神经网络可以令人惊讶地很好地推广 ( 两次下降 )

  • After reaching a minimum, test cross-entropy tends to gradually increase over time while test accuracy improves (overconfidence). This can be avoided using temperature scaling.

    达到最小值后, 测试交叉熵往往会随着时间的推移逐渐增加,而测试精度会提高 ( 过度自信 )。 使用温度调节可以避免这种情况。

Let’s get started.

让我们开始吧。

1.关于偏差方差折衷的经典理论的回顾 (1. Review of classical theory on the bias-variance trade-off)

Before we get started, I will offer you two options. If you are tired of hearing about the bias-variance trade-off for the 100th time, please read the TLDR at the end of Section 1 and then move on to Section 2. Otherwise, I will briefly introduce the bare minimum needed to understand the basics before moving on with the actual paper.

在我们开始之前,我将为您提供两种选择。 如果您厌倦了第100次听到偏差方差的折衷,请阅读第1节末尾的TLDR ,然后继续第2节。否则,我将简要介绍理解该偏差的最低要求。基础知识,然后再继续实际的论文。

The predictive error for all supervised learning algorithms can be broken into three (theoretical) parts, which are essential to understand the bias-variance tradeoff. These are; 1) Bias 2) Variance 3) Irreducible error (or noise term)

所有监督学习算法的预测误差都可以分为三个(理论上的)部分,这对于理解偏差方差的权衡至关重要。 这些是; 1)偏差2)方差3)不可减少的误差(或噪声项)

The irreducible error (sometimes called noise) is a term disconnected from the chosen model which can never be reduced. It is an aspect of the data arising due to an imperfect framing of the problem, meaning we will never be able to capture the true relationship of the data — no matter how good our model is.

不可减少的误差 (有时称为噪声)是一个与所选模型脱节的术语,永远无法减少。 这是由于问题框架不完善而产生的数据的一个方面,这意味着无论我们的模型多么出色,我们将永远无法捕获数据的真实关系。

The bias term is generally what people think of when they refer to model (predictive) errors. In short, it measures the difference between the “average” model prediction and the ground truth. Average might seem strange in this case as we typically only train one model. Think of it this way. Due to small perturbations (randomness) in our data, we can get slightly different predictions even with the same model. By averaging the range of predictions we get due to these perturbations, we obtain the bias term. High bias is a sign of poor model fit (underfitting), as it will have a large prediction error on both the training and test set.

偏差术语通常是人们指的是模型(预测)错误时的想法。 简而言之,它衡量的是“平均”模型预测与基本事实之间的差异。 在这种情况下,平均值可能看起来很奇怪,因为我们通常只训练一个模型。 这样想吧。 由于我们数据中的小扰动(随机性),即使使用相同的模型,我们也可以得到略有不同的预测。 通过平均由于这些扰动而获得的预测范围,可以得到偏差项。 高偏差是模型拟合(拟合不足)的标志,因为它将在训练和测试集上都具有较大的预测误差。

Finally, the variance term refers to the variability of the model prediction for a given data point. It might sound similar, but the key difference lies in the “average” versus “data point”. High variance implies high generalization error. For example, while a model might be relatively accurate on the training set, it can achieve a considerably poor fit on the test set. This latter scenario (high variance, low bias) is typically the most likely when training overparameterized neural networks, i.e., what we refer to as overfitting.

最后, 方差项是指给定数据点的模型预测的方差 。 听起来可能很相似,但主要区别在于“平均”与“数据点”。 高方差意味着高泛化误差。 例如,虽然模型在训练集上可能相对准确,但是可以在测试集上实现相当差的拟合。 在训练过参数化的神经网络(即我们所谓的过拟合 )时,通常最有可能出现后一种情况(高方差,低偏差)。

The practical implication of these terms implies balancing the bias and variance (hence the name trade-off), typically controlled via model complexity. The ultimate goal is to obtain low bias and low variance. This is the typical U-shape test error curve you might have seen before.

这些术语的实际含义意味着通常通过模型复杂性来控制偏差和方差(因此需要权衡取舍)。 最终目标是获得低偏差和低方差。 这是您之前可能看到的典型U形测试误差曲线。

Image for post
https://www.digitalvidya.com/blog/bias-variance-tradeoff/ https://www.digitalvidya.com/blog/bias-variance-tradeoff/

Alright, I will assume you know enough about the bias-variance trade-off to understand why the original claim that overparameterized neural networks do not necessarily imply high variance is puzzling, indeed.

好吧,我假设您对偏差方差的权衡已经足够了解,以了解为什么原来声称过参数化神经网络不一定意味着高方差的说法令人困惑。

TLDR; high variance, low bias is a sign of overfitting. Overfitting happens when a model achieves high accuracy on the training set but low accuracy on the test set. This typically happens for overparameterized neural networks.

TLDR; 高方差,低偏差是过度拟合的标志。 当模型在训练集上达到高精度但在测试集上达到低精度时,就会发生过度拟合。 对于过参数化的神经网络,通常会发生这种情况。

2.现代体制-更大的模型更好! (2. Modern Regime — Larger models are better!)

In practice, we typically optimize the bias-variance trade-off using a validation set with (for example) early stopping. Interestingly, this approach might be completely wrong. Over the past few years, researchers have found that if you keep fitting increasingly flexible models, you obtain what is termed double descent, i.e., generalization error will start to decrease again after reaching an intermediary peak. This finding is empirically validated in Nakkiran et al. (2019) for modern neural network architectures on established and challenging datasets. See the following figure from OpenAI, which shows this scenario;

实际上,我们通常使用带有(例如)提前停止的验证集来优化偏差方差的权衡。 有趣的是,这种方法可能是完全错误的 。 在过去的几年中,研究人员发现,如果您保持拟合度越来越高的模型,您将获得所谓的两次下降 ,即泛化误差将在达到中间峰后再次开始下降。 该发现在Nakkiran等人中得到了经验验证 (2019)中关于已建立且具有挑战性的数据集的现代神经网络架构。 请参见下图,来自OpenAI,它显示了这种情况;

Image for post
https://openai.com/blog/deep-double-descent/ https://openai.com/blog/deep-double-descent/
  • Test error initially declines until it reaches a (local) minimum, and then starts increasing again with increasing complexity. In the critical regime, it is important that we keep adding model complexity, as the test error will start to decline again, eventually reaching a (global) minimum.

    测试错误最初会下降,直到达到(本地)最小值为止,然后又随着复杂性的增加而再次增加。 在关键条件下,重要的是我们要不断增加模型的复杂性,因为测试误差将再次开始下降,最终达到(全局)最小值。

These findings imply that larger models are generally better due to the double descent phenomena, which challenges the long-held viewpoint regarding overfitting for overparameterized neural networks.

这些发现暗示,由于双重下降现象,较大的模型通常更好,这挑战了人们对于长期拟合过参数化神经网络的长期观点。

3.相对排名假说 (3. Relative Ranking-Hypothesis)

Having established that large overparameterized neural networks can generalize well, we want to take it one step further. Enter the relative ranking hypothesis. Before we explain the hypothesis, we note that if proven true, then you can potentially perform model selection and hyperparameter tuning on a small subset of your training dataset for your next experiment, and by doing so save computational resources and valuable training time.

确定大型超参数化神经网络可以很好地泛化之后,我们想进一步迈出一步。 输入相对排名假设 。 在解释该假设之前,我们注意到如果被证明是正确的,那么您有可能可以在下一个实验的训练数据集中的一小部分上执行模型选择和超参数调整,从而节省了计算资源和宝贵的训练时间。

We will briefly introduce the hypothesis followed by a few experiments to validate the claim. As an additional experiment not included in the literature (as far as we know), we will investigate one setting that could potentially invalidate the relative ranking hypothesis, which is imbalanced datasets.

我们将简要介绍该假设,然后进行一些实验以验证该要求。 作为文献中未包含的另一项实验(据我们所知),我们将研究一种可能使相对排名假设无效的设置,即不平衡数据集

a)理论 (a) Theory)

One of the key hypotheses of Bornschein (2020) is;

Bornschein(2020)的主要假设之一是:

“overparameterized model architectures seem to maintain their relative ranking in terms of generalization performance, when trained on arbitrarily small subsets of the training set”.

“当在训练集的任意小子集上训练时,过参数化的模型体系结构似乎在泛化性能方面保持其相对排名”

They call this observation the relative ranking-hypothesis.

他们称这种观察为相对排名假说

In layman terms; let’s say we have 10 models to choose from, numbered from 1 to 10. We train our models on a 10% subset of the training data and find that model 6 is the best, followed by 4, then 3, and so on..

外行术语 ; 假设我们有10种模型可供选择,编号从1到10。我们在训练数据的10%子集上训练模型,发现模型6是最好的,其次是4,然后是3,依此类推。

The ranking hypothesis postulates, that as we gradually increase the subset percentage from 10% all the way up to 100%, we should obtain the exact same ordering of optimal models.

排名假设假设,随着我们将子集百分比从10%逐渐增加到100%,我们应该获得与最优模型完全相同的顺序。

If this hypothesis is true, we can essentially perform model selection on a small subset of the original data to the added benefit of much faster convergence. If this was not controversial enough, the authors even take it one step further as they found some experiments where training on small datasets led to more robust model selection (less variance), which certainly seems counterintuitive given that we would expect relatively more noise for smaller datasets.

如果这个假设是正确的,我们就可以对原始数据的一小部分进行模型选择,从而获得更快收敛的好处。 如果这还没有引起足够的争议,那么作者甚至更进一步,因为他们发现一些实验在小型数据集上进行训练导致模型选择更加可靠(方差较小),这显然是违反直觉的,因为我们希望较小的噪声相对更大数据集。

b)温度校准 (b) Temperature calibration)

One strange phenomenon when training neural network classifiers is, that cross-entropy error tends to increase while classification error decreases. This seems counterintuitive but is simply due to models becoming overconfident in their predictions ( Guo et al. (2017)). We can use something called temperature scaling, which calibrates the cross-entropy estimates on a small held-out dataset. This yields more generalizable and well-behaved results compared to classical cross-entropy, especially relevant for overparameterized neural networks. As a rough analogy, you can think of this as providing less “false negatives” regarding the number of overfitting cases.

训练神经网络分类器时一个奇怪的现象是, 交叉熵误差趋于增加,而分类误差则减小 。 这似乎违反直觉,但这仅仅是由于模型在其预测中变得过于自信 ( Guo等人(2017) )。 我们可以使用一种称为温度缩放的工具 ,该工具可以在一个小的保留数据集上校准交叉熵估计。 与经典的交叉熵相比,这产生了更通用和行为良好的结果,特别是与过参数化的神经网络有关。 粗略地类推,您可以认为这为过度拟合案例的数量提供了较少的“假阴性”。

While Bornschein (2020) do not provide explicit details on the exact softmax temperature calibration procedure used in their paper, we use the following procedure for our experiments;

尽管Bornschein(2020)并未提供其论文中使用的softmax温度校准程序的明确细节,但我们在实验中使用了以下程序;

  • We define a held-out calibration dataset, C, equivalent to 10% of the training data.

    我们定义一个保留的校准数据集C,相当于训练数据的10%。
  • We initialize the temperature scalar to be 1.5 (like in Guo et al. (2017))

    我们将温度标量初始化为1.5(类似于Guo et al。(2017))

对于每个时代; (For each epoch;)

  • 1) Calculate cross-entropy loss on our calibration set C

    1)在我们的校准集C上计算交叉熵损失
  • 2) Optimize the temperature scalar using gradient descent on the calibration set (see this Github repo by Guo et al. (2017))

    2)使用校准集上的梯度下降来优化温度标量( 请参阅Guo et al。(2017)的Github回购 )

  • 3) Use the updated temperature scalar to calibrate the regular cross-entropy during gradient descent

    3)在梯度下降过程中使用更新的温度标量来校准规则的交叉熵
  • After training for 50 epochs, we calculate the calibrated test error, which should no longer show signs of overconfidence.

    在训练了50个纪元后,我们计算出校准的测试误差,该误差不再显示过度自信的迹象。

Let us now turn to the experimental setting.

现在让我们转到实验设置。

4.实验 (4. Experiments)

We will conduct two experiments in this post. One for validating the relative ranking-hypothesis on the MNIST dataset, and one for evaluating how our conclusions change if we synthetically make MNIST imbalanced. This latter experiment is not included in the Bornschein (2020) paper, and could potentially invalidate the relative ranking-hypothesis for imbalanced datasets.

我们将在本文中进行两个实验。 一种用于验证MNIST数据集上的相对排名假设,另一种用于评估如果我们综合使MNIST不平衡,我们的结论将如何变化。 后一个实验未包括在Bornschein(2020)的论文中,并且可能使不平衡数据集的相对排名假设无效。

MNIST (MNIST)

We start by replicating the Bornschein (2020) study on MNIST, before moving on with the imbalanced dataset experiment. This is not meant to disprove any of the claims in the paper, but simply to ensure we have replicated their experimental setup as closely as possible (with some modifications).

在继续进行不平衡数据集实验之前,我们首先复制关于MNIST的Bornschein(2020)研究。 这并不是要证明本文中的任何权利要求,而只是为了确保我们尽可能地复制了他们的实验装置(进行了一些修改)。

  • Split of 90%/10% for the training and calibration sets, respectively

    分别为训练和校准集分配90%/ 10%
  • Random sampling (as balanced subset sampling did not provide any added benefit according to the paper)

    随机抽样(根据本文,平衡子集抽样没有提供任何额外的好处)
  • 50 epochs

    50纪
  • Adam with fixed learning rate [10e-4]

    亚当的学习率固定[10e-4]
  • Batch size = 256

    批次大小= 256
  • Fully connected MLPs with 3 hidden layers and 2048 units each

    完全连接的MLP,具有3个隐藏层和每个2048个单元
  • Without dropout (made our results too unstable to include)

    没有辍学(导致我们的结果太不稳定而无法包括在内)
  • A simple convolutional network with 4 layers, 5x5 spatial kernel, stride 1 and 256 channels

    一个简单的卷积网络,具有4层,5x5空间内核,步幅1和256通道
  • Logistic regression

    逻辑回归
  • 10 different seeds to visualize uncertainty bands (30 in the original paper)

    10种不同的种子来可视化不确定性带(原始论文中有30种)

The authors also mention experimenting with replacing ReLU with tanh, batch-norm, layer-norm, etc., but it is unclear if these tests were included in their final results. Thus, we only consider the experiment using the above settings.

作者还提到尝试用tanh,批处理规范,图层规范等替换ReLU,但是目前尚不清楚这些测试是否包含在最终结果中。 因此,我们仅考虑使用以上设置的实验。

实验1:梯度下降期间的温度缩放如何影响一般性? (Experiment 1: How does temperature scaling during gradient descent affect generalization?)

As an initial experiment, we want to validate why temperature scaling is needed. For this, we train an MLP using ReLU and 3 hidden layers of 2048 units each, respectively. We do not include dropout and we train for 50 epochs.

作为初始实验,我们想验证为什么需要温度缩放。 为此,我们分别使用ReLU和每个2048个单元的3个隐藏层来训练MLP。 我们不包括辍学,我们训练了50个纪元。

Our hypothesis is: The test cross-entropy should gradually increase while test accuracy decreases over time (motivation for temperature scaling in the first place, i.e., model overconfidence).

我们的假设是:测试交叉熵应逐渐增加,而测试精度则随时间降低(首先是温度缩放的动机,即模型过度自信)。

Here are the results of this initial experiment:

以下是此初始实验的结果:

Image for post

Clearly, the test entropy does decline initially and then gradually increases over time while test accuracy keeps improving. This is evidence in favor of hypothesis 1. Figure 3 in Guo et al. (2017) demonstrates the exact same effect on CIFAR-100.Note: We have smoothed the results a bit (5-window rolling mean) to make the effect more visible.

显然,测试熵确实会先下降,然后随时间逐渐增加,同时测试精度会不断提高。 这是支持假设1的证据。Guo等人的图3。 (2017)证明了对CIFAR-100的完全相同的作用。 注意:我们对结果进行了一些平滑处理(5窗口滚动平均值),以使效果更明显。

Conclusions from Experiment 1:

实验1的结论:

  • If we keep training large neural networks for sufficiently long, we start to see overconfident probabilistic predictions, making them less useful out-of-sample.

    如果我们继续训练大型神经网络足够长的时间,我们就会开始看到过分自信的概率预测,从而使它们在样本外的使用率降低。

To remedy this effect, we can incorporate temperature scaling which

为了纠正这种影响,我们可以结合使用温度缩放

  • a) ensures probabilistic forecasts are more stable and reliable out-of-sample, and

    a)确保概率预测更加稳定和可靠的样本外;以及

  • b) improves generalization by scaling training cross-entropy during gradient descent.

    b)通过在梯度下降过程中缩放训练交叉熵来提高泛化能力。

平衡数据集 (Balanced Dataset)

Having shown that temperature scaling is needed, we now turn to the primary experiment — i.e., how does test cross-entropy vary as a function of the size of our training dataset. Our results look as follows:

在表明需要进行温度缩放后,我们现在转向主要实验-即,测试交叉熵如何随训练数据集的大小而变化。 我们的结果如下所示:

Image for post
Test cross-entropy as a function of the size of the training set for MNIST
测试交叉熵与MNIST训练集大小的关系

Note, we do not obtain the exact same “smooth” results as Bornschein (2020). This is most likely due to the fact, that we have not replicated their experiment completely, as they for example include many more different seeds. Nevertheless, we can draw the following conclusions:

注意,我们无法获得与Bornschein(2020)完全相同的“平滑”结果。 这很可能是由于以下事实:我们尚未完全复制他们的实验,例如它们包含更多不同的种子。 尽管如此,我们可以得出以下结论:

  • Interestingly, the relatively large ResNet-18 model does not overfit more than logistic regression at any point during training!

    有趣的是,相对较大的ResNet-18模型在训练期间的任何时候都不会比逻辑回归拟合得更多!
  • The relative ranking-hypothesis is confirmed

    相对排名假设得到确认
  • Beyond 25000 observations (roughly half of the MNIST train dataset), the significantly larger ResNet model is only marginally better than the relatively faster MLP model.

    除了25000个观测值(大约是MNIST火车数据集的一半)以外,明显更大的ResNet模型仅比相对较快的MLP模型略胜一筹。

数据集不平衡 (Imbalanced Dataset)

We will now conduct an experiment for the case of imbalanced datasets, which is not included in the actual paper, as it could be a setting where the tested hypothesis is invalid.

现在,我们将针对不平衡数据集的情况进行实验,该实验未包含在实际论文中,因为这可能是检验假设无效的设置。

We sample an artificially imbalanced version of MNIST similar to Guo et al. (2019). The procedure is as follows. For each class in our dataset, we subsample between 0 and 100 percent of the original training and test dataset. We use the following GitHub repo for this sampling procedure.

我们采样了与Guo等类似的人为失衡的MNIST版本 (2019) 。 步骤如下。 对于数据集中的每个课程,我们在原始训练和测试数据集中的0%到100%之间进行子采样。 我们将以下GitHub存储库用于此采样过程。

Then, we select our calibration dataset similar to the previous experiment, i.e., random 90/10% split between training and calibration.

然后,我们选择类似于先前实验的校准数据集,即在训练和校准之间随机分配90/10%。

We include a visualization of the classes distribution for the original MNIST training dataset

我们包括原始MNIST训练数据集的班级分布的可视化

Image for post
Frequency count for each class in MNIST
MNIST中每个类别的频率计数

and the imbalanced version

不平衡版本

Image for post
Frequency count for each class in imbalanced MNIST
不平衡MNIST中每个类别的频率计数

Given this large difference in the frequency distribution, you can clearly see how this version is much more imbalanced compared to the original MNIST.

考虑到频率分布的巨大差异,您可以清楚地看到与原始MNIST相比,此版本的不平衡性如何。

While a plethora of different methods for overcoming the problem of imbalanced datasets exist (see the following review paper), we want to investigate and isolate the effects of having an imbalanced dataset for the relative ranking hypothesis, i.e., does the relative ranking-hypothesis still hold in the imbalanced data setting?

尽管存在多种解决不平衡数据集问题的不同方法(请参阅以下评论文章 ),但我们想研究和隔离不平衡数据集对于相对排名假设的影响,即相对排名假设是否仍然存在保持不平衡的数据设置?

We run all our models again using this synthetically imbalanced MNIST dataset, and obtain the following results:

我们使用此综合不平衡MNIST数据集再次运行所有模型,并获得以下结果:

Image for post
Test cross-entropy as a function of the size of the training set for imbalanced MNIST
测试交叉熵与不平衡MNIST训练集大小的关系

So has the conclusion changed?

那么结论是否改变了?

Not really!

不是真的

This is quite an optimistic result, as we are now more confident, that the relative ranking-hypothesis is mostly true in the case of imbalanced datasets. We believe this could also be the reason behind the quote from the Bornschein (2020) paper regarding the sampling strategy;

正如我们现在更加确信的那样,这是一个非常乐观的结果,即相对排名假设在数据集不平衡的情况下大部分是正确的。 我们认为,这也可能是Bornschein(2020)论文引用抽样策略背后的原因。

“We experimented with balanced subset sampling, i.e. ensuring that all subsets always contain an equal number of examples per class. But we did not observe any reliable improvements from doing so and therefore reverted to a simple i.i.d sampling strategy.”

“我们尝试了平衡子集抽样,即确保每个子集的所有子集始终包含相同数量的示例。 但是我们并没有观察到任何可靠的改进,因此恢复了简单的iid采样策略。”

The primary difference between the balanced and imbalanced version is the more “jumpy” results, which makes sense given that there might be classes available in the test set not seen during training for the chosen models.

平衡版本和不平衡版本之间的主要区别是结果更“ 跳动 ”,这是有道理的,因为测试模型中可能存在某些训练中未选模型的可用类。

5.总结 (5. Summary)

To sum up our findings:

总结一下我们的发现:

  • Due to the relative ranking-hypothesis, we can perform model selection using only a subset of our training data for both balanced and imbalanced datasets, thus saving computational resources

    由于相对的排序假设 ,我们可以仅使用训练数据的一个子集来进行模型选择 (无论是平衡数据集还是不平衡数据集),从而节省了计算资源

  • Large overparameterized neural networks can generalize surprisingly well, even on small datasets (double descent)

    大型的超参数化神经网络即使在较小的数据集( 两次下降 )上也能令人惊讶地很好地泛化

  • We can avoid overconfidence by applying temperature scaling

    我们可以通过应用温度缩放避免过度自信

I hope that you might be able to apply these findings in your next machine learning experiments and remember, larger is (almost) always better.

我希望您可以在下一个机器学习实验中应用这些发现,并记住,更大(几乎)总是更好。

Thank you for reading!

感谢您的阅读!

6.参考 (6. References)

[1] J. Bornschein, F. Visin, and S. Osindero, Small Data, Big Decisions: Model Selection in the Small-Data Regime (2020), in International Conference on Machine Learning (ICML).

[1] J. Bornschein,F。Visin和S. Osindero,《小数据,大决策:小数据体制中的模型选择》(2020年),在国际机器学习会议(ICML)中。

[2] P. Nakkiran, G. Kaplun, Y. Yang, B. T. Barak, and I. Sutskever, Deep double descent: Where bigger models and more data hurt (2019), arXiv preprint arXiv:1912.02292.

[2] P. Nakkiran,G.Kaplun,Y.Yang,BT Barak和I.Sutskever,《深双血统:更大的模型和更多的数据受到伤害》(2019),arXiv预印本arXiv:1912.02292。

[3] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks (2017), arXiv preprint arXiv:1706.04599.

[3] C. Guo,G. Pleiss,Y. Sun和KQ Weinberger,Guo。C.,Pleiss,G.,Sun,Y.,&Weinberger,KQ(2017)。 关于现代神经网络的校准(2017),arXiv预印本arXiv:1706.04599。

[4] T. Guo, X. Zhu, Y. Wang, and F. Chen, Discriminative Sample Generation for Deep Imbalanced Learning (2019), in International Joint Conferences on Artificial Intelligence Organization (IJCAI) (pp. 2406–2412).

[4] T. Guo,X X. Zhu,Wang Y和Wang F. Chen,《深度不平衡学习的判别样本生成》(2019年),在国际人工智能组织联合会议(IJCAI)上发表(第2406-2412页)。

Originally published at https://holmdk.github.io on August 14, 2020.

最初于 2020年8月14日 发布于 https://holmdk.github.io

翻译自: https://towardsdatascience.com/model-selection-with-large-neural-networks-and-small-data-955b4d929d55

模型 标签数据 神经网络

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值