理解深度学习需要重新思考泛化

最新推荐文章于 2022-06-14 21:27:00 发布

weixin_26704853

最新推荐文章于 2022-06-14 21:27:00 发布

阅读量550

点赞数

文章标签： python 深度学习机器学习人工智能 java

原文链接：https://medium.com/@sarthakchakraborty_69770/understanding-deep-learning-requires-re-thinking-generalization-fe94889bb14c

版权

The following analysis is my interpretation of the ICLR 2017 paper “Understanding Deep Learning requires Re-Thinking Generalization”(arXiv link). This paper was awarded one of the three Best Paper Award in ICLR 2017. I enjoyed reading the paper as it questions the conventional understanding of generalization in learning models and how regularization isn’t the only cause of generalization.

以下分析是我对ICLR 2017论文“理解深度学习需要重新思考泛化” ( arXiv链接 )的解释。这篇论文被授予ICLR 2017三届最佳论文奖之一。我很喜欢阅读这篇论文，因为它质疑学习模型中对泛化的传统理解，以及正则化不是泛化的唯一原因。

主要发现 (Key Findings)

The paper provides an insight and a comprehensive experimental study showed that the concept of generalization that we have might be flawed. By random labelling, “Intuition suggests that this impossibility should manifest itself clearly during training, e.g., by training not converging or slowing down substantially.” However, the paper experimentally shows that it is not the case.

本文提供了一个见识，一项全面的实验研究表明，我们所具有的泛化概念可能是有缺陷的。通过随机标记，“ 直觉表明，这种可能性应在训练过程中清楚地表现出来，例如，通过训练而不会明显收敛或减慢。 ”但是，论文通过实验表明事实并非如此。

The two main findings of the paper are emphasized in italics.

本文的两个主要发现以斜体字强调。

1. ‘Deep neural networks easily fit random labels.’

1.“深度神经网络很容易拟合随机标签。”

2. ‘Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.’

2.“显式正则化可能会提高泛化性能，但对于控制泛化错误而言既不是必需的，也不是足够的。”

Following the first statement, it was unclear as to what might be the relationship of the number of parameters and the dataset since any network can overfit if the number of data samples are very less as compared to the number of trainable parameters. So the results might not be well justified unless there is a clear comparison done with respect to the number of data samples. Later, it was made clear with the theoretical guarantees that if the number of parameters(p) is >= 2n+d, it can represent any function. This O(n+d) guarantee was insightful and new to me. It is quite surprising that the number of parameters is only in the linear order of ‘n’ and ‘d’.

在第一条陈述之后，尚不清楚参数数量与数据集之间的关系是什么，因为如果数据样本数量比可训练参数数量少得多，则任何网络都可能过拟合。因此，除非对数据样本的数量进行清楚的比较，否则结果可能没有充分的依据。后来，有了理论上的保证就明确了：如果参数(p)的数量> = 2n + d ，则它可以表示任何函数。这个O(n + d)保证对我来说是有见地和新颖的。令人惊讶的是，参数的数量仅处于“ n”和“ d”的线性顺序。

The work tries to project that if the data has random labelling or even if the true pixels are randomly shuffled, standard classification models like Inception and AlexNet are able to achieve more than 99% train accuracy. Such was the observation even with regularization, which corroborates the claim that regularization isn’t the only factor playing behind generalization of a model.

这项工作试图预测，如果数据具有随机标记，或者即使对真实像素进行了随机混洗，标准分类模型(例如Inception和AlexNet)也可以实现超过99％的训练精度。即使使用正则化也是如此，这证实了以下观点：正则化不是模型泛化的唯一因素。

Coming to the second point, an important statement was highlighted by the authors stating that ‘in contrast with classical convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error.’

关于第二点，作者强调了一个重要的陈述，即“ 与经典凸经验经验最小化相反，在经典凸实证风险最小化中， 必须使用显式正则化来排除琐碎的解决方案，我们发现，正则化在深度学习中起着完全不同的作用。 它似乎更多是一个调整参数，通常可以帮助改善模型的最终测试误差，但是缺少所有正则化并不一定意味着不良的泛化误差。

神经网络的有效容量 (Effective Capacity of Neural Networks)

To gain an insight as to what might be the capacity of a network model, extensive experiments were performed where the true labels were replaced with random labelling as well as true pixels were randomly shuffled for the images. Defying the intuition that random labelling might show significant changes in convergence, these transformation of labels hardly interfered with the training process even after the use of explicit and implicit regularization.

为了了解网络模型的容量，我们进行了广泛的实验，其中用随机标记替换了真实标签 ，并为图像随机洗了真实像素 。尽管直觉认为随机标记可能会显示出收敛的显着变化，但是即使在使用显式和隐式正则化之后，标记的这些转换也几乎不会干扰训练过程。

Theoretical bounds like VC dimension and Rademacher complexity is commonly used as a measure of capacity of learning models. However, authors have challenged from their experimental results that such bounds does not seem to explain the concept of generalization.

像VC维数和Rademacher复杂度这样的理论界限通常用作衡量学习模型能力的指标。但是，作者从他们的实验结果中挑战说，这样的界限似乎并不能解释泛化的概念。

正则化的作用 (Role of Regularization)

Commonly used explicit regularization techniques like weight decay, data augmentation and dropout were covered on standard architectures and their results are shown in the table. It can be seen that even with regularization, architectures achieve an accuracy of over 99% concluding that regularization is not enough for generalization of models or simply, our naive understanding of generalization is flawed.

标准体系结构涵盖了常用的显式正则化技术(例如权重衰减 ， 数据增加和数据 丢失) ，其结果显示在表中。可以看出，即使使用正则化，体系结构也可以实现超过99％的准确性，这表明正则化不足以进行模型的泛化，或者简单地说，我们对泛化的幼稚理解是有缺陷的。

Early stopping was also shown to implicitly regularize on some convex problems and hence can be a potential technique to improve generalization. In summary, observations on both explicit and implicit regularizers are consistently suggesting that regularizers, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.

还显示了提前停止可以隐式地对某些凸问题进行正则化，因此可以成为提高泛化能力的潜在技术。总而言之，对显式和隐式正则化器的观察一致地表明，对正则化器进行适当调整后，可以帮助改善泛化性能。但是，正则化器不太可能是普遍化的基本原因，因为在删除所有正则化器之后网络仍将继续表现良好 。

有限样本表达 (Finite-Sample Expressivity)

Coming down to the important question, how can the expressiveness of an architecture be defined? The authors have tried to incorporate a non-traditional approach where they claim that it is more relevant in practice is the expressive power of neural networks on a finite sample of size n. Theoretically, they have deduced that

归结为重要的问题，如何定义体系结构的表现力？作者试图结合一种非传统方法，他们声称在实践中更相关的是神经网络在大小为n的有限样本上的表达能力。 从理论上讲，他们推论到

There exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function on a sample of size n in d dimensions.

存在一个具有ReLU激活和2n + d权重的两层神经网络，可以表示d维中大小为n的样本上的任何函数。

However, the proof is sketched for ReLU activation function and it might be interesting to know what might be the relation for ‘tanh’ or ‘sigmoid’ activation function, or whether ‘Leaky ReLU’ or ‘ELU’ gives similar results.The statement “It is difficult to say that the regularizers count as a fundamental phase change in the generalization capability of deep nets” is again a new learning since regularization was thought as a way to make the neural nets more general. In addition to that, it was unclear as why AlexNet failed to converge on adding regularization despite the theoretical guarantees.

但是，证明是针对ReLU激活功能的草图，可能很有趣的是知道“ tanh”或“ Sigmoid”激活功能之间的关系，或者“ Leaky ReLU”或“ ELU”给出的结果相似。 很难说正则化器被视为深网泛化能力中的基本相位变化，这又是一项新的学习，因为正则化被认为是使神经网络更通用的一种方式。除此之外，尚不清楚为何尽管有理论上的保证，AlexNet仍未能收敛于添加正则化。

隐式正则化：线性模型的吸引力 (Implicit Regularization: An Appeal to Linear Models)

In order to understand the source of generalization, the authors have used simple linear models from which insights can be drawn forth. The authors have posed a serious question indicating whether all global minima generalizes equally well or whether there is a way to determine whether one global minima will be better than the other in terms of generalization?

为了理解泛化的来源，作者使用了简单的线性模型，可以从中得出见解。作者提出了一个严重的问题，表明所有全局最小值是否都可以很好地概括，或者是否有办法确定一个全局最小值在泛化方面是否会优于另一个全局最小值？

Manifesting the simple ‘kernel trick’ to uniquely identify a solution and using preprocessing techniques like using random convolution layers can yield quite surprising results where the test errors are pretty less even in the absence of regularization. This poses a grave questions about traditional generalization notions. However, questions like computing the Gram Matrix(for kernel trick) for n data samples requires a lot of computation which is infeasible, the exact reason why we shifted to iterative procedure remain answered.

表现出简单的“ 内核技巧 ”以唯一标识解决方案，并使用预处理技术(例如使用随机卷积层)可以产生令人惊讶的结果，即使没有正则化， 测试错误也要少得多 。这对传统的泛化概念提出了严重的问题。但是，诸如为n个数据样本计算Gram Matrix(用于内核技巧)之类的问题需要大量的计算，这是不可行的，而我们转向迭代过程的确切原因仍然可以得到解答。

结论 (Conclusion)

In the end, the paper provides great insights regarding the pre-conceived notion of generalization and how it is flawed and cannot be used to explain the behaviour of different architectures. Consequently, models used are rich enough to memorize data. This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks. Thus, a formal measure of generalization is yet to be understood. This paper can be a great precursor for the research in explainability of neural architectures.

最后，本文提供了关于先入为主的概化概念的广泛见解，以及它是如何存在缺陷并且无法用于解释不同体系结构的行为的。因此，所使用的模型足够丰富，可以存储数据。当传统的模型复杂性度量难以解释大型人工神经网络的泛化能力时，这种情况对统计学习理论提出了概念上的挑战。因此，一般化的形式化度量尚待理解。本文可以作为神经体系结构可解释性研究的重要先驱。