无需任何培训即可获得最佳的神经网络性能

本文探讨了神经网络在过度参数化时依然能够泛化的现象,挑战了正则化是泛化关键的传统观点。彩票假说提出,随机初始化的网络中存在能实现与训练网络相同性能的子网络。研究发现,大规模随机神经网络内部存在能与高性能网络匹配的小型子网络。通过对随机加权子网的训练,可以达到与经过优化的网络相似的性能。
摘要由CSDN通过智能技术生成

How do neural networks generalize when they are so overparametrized?

当神经网络过于参数化时,它们如何概括?

As we try to answer this question with recent research, we will find that we know much less about neural networks than we thought, and understand why a random initialized network can perform just as well as a trained one.

当我们尝试用最近的研究来回答这个问题时,我们会发现我们对神经网络的了解比我们想象的要少得多,并且理解为什么随机初始化的网络可以像训练有素的网络一样好。

In more standard machine learning practice, it’s conventional to minimize the number of parameters in a model to prevent overfitting and ensure true learning instead of memorization. On the other hand, machine learning engineers simply keep on stuffing neural networks to become larger and larger, which somehow works. This violates what should be common sense.

在更标准的机器学习实践中,通常会尽量减少模型中的参数数量,以防止过度拟合并确保真正的学习而不是记忆。 另一方面,机器学习工程师只是不断地填充神经网络,使其变得越来越大,并且以某种方式起作用。 这违反了常识。

One should not increase, beyond what is necessary, the number of entities required to explain anything.- Occam’s Razor

不应超出必要范围而增加解释任何事物所需的实体数量。- Occam的Razor

It’s not uncommon for modern neural networks to achieve 99.9 percent or even 100 percent accuracy on the training set — which usually would be a warning of overfitting. Surprisingly, however, neural networks can achieve similarly high test set scores.

现代神经网络在训练集上达到99.9%甚至100%的准确性并不罕见-通常会警告过拟合。 但是,令人惊讶的是,神经网络可以达到类似的高测试集分数。

One common answer proposed as to why neural networks don’t overfit is the role of regularization. Unfortunately, this isn’t the case — in a study conducted by Zhang et al., an Inception architecture without various regularization methods didn’t perform much worse than one with regularization. Thus, one cannot argue that regularization is the basis for generalization.

关于神经网络为什么不会过拟合的一个常见答案是正则化的作用。 不幸的是,事实并非如此-在Zhang等人进行的一项研究中,没有各种正则化方法的Inception架构的性能并没有比带有正则化的体系结构差很多。 因此,不能说正则化是泛化的基础

Image for post
Zhang et al. Image free to share. Zhang等。 图片免费分享。

Neural network pruning offers a glimpse into one convincing answer, however, which certainly shifts perspectives on how neural networks work.

神经网络修剪提供了一个令人信服的答案,但是,这肯定会改变对神经网络工作原理的看法。

With neural network pruning, over 90 percent — in some cases 95 or even 99 percent — of neural network weights and neurons can be eliminated with little to no loss on performance. This begs the question: if a network can perform equivalently with only 5 percent of the weights, what purpose does the remainder 95 percent actually serve?

通过修剪神经网络,可以消除90%以上的神经网络权重和神经元,有时甚至达到95%甚至99%,而性能几乎没有损失。 这就引出了一个问题:如果一个网络仅用5%的权重就可以等效地工作,那么其余95%的权重实际上起到什么作用?

Image for post
TensorFlow. Image free to share. TensorFlow 。 图片免费分享。

This gives one partial answer to the generalization question — neural networks are overparametrized, but somehow don’t use the majority of the weights. Instead, a few weights carry most of the information and the rest serve as some combination of noise and passive components. Perhaps they store memorized information only pertaining to the training set (neural networks can obtain perfect accuracy with completely random labels).

这为泛化问题提供了部分答案-神经网络被过度参数化,但是以某种方式不使用大多数权重。 取而代之的是,一些砝码承载了大部分信息,其余的则充当了噪声和无源分量的某种组合。 也许他们存储的记忆信息仅与训练集有关(神经网络可以使用完全随机的标签获得完美的准确性)。

Luckily, this idea has been formalized as the Lottery Ticket Hypothesis. Simply put, a neural network is a massive random lottery — weights are randomly initialized. As the optimizer determines how to update certain weights considering their current values, certain subnetworks are delegated to carry most of the information flow simply because their initialized weights were the right values to spark growth.

幸运的是,这个想法已经正式化为彩票假说。 简而言之,神经网络是一个庞大的随机彩票-权重是随机初始化的。 当优化器确定如何根据当前值更新某些权重时,某些子网被委派来承载大多数信息流,这仅仅是因为它们的初始化权重是引发增长的正确值。

These subnetworks are said to be ‘winning tickets’. Analyzing neural networks through lens of the Lottery Ticket Hypothesis answers many questions about neural networks and deep learning:

这些子网被称为“中奖彩票”。 通过彩票假说的视角分析神经网络,回答了有关神经网络和深度学习的许多问题:

  • Q: Why do larger neural networks generally perform better?

    :为什么大型神经网络通常表现更好?

    Q: Why do larger neural networks generally perform better? A: They operate larger lotteries with more connections and hence have a higher chance of finding successful winning tickets.

    :为什么大型神经网络通常表现更好? :他们经营更大的彩票,拥有更多的联系,因此更有可能找到成功的中奖彩票。

  • Q: Why does training directly from sparse networks perform worse than first training from a large dense network and pruning?

    :为什么直接从稀疏网络进行的训练要比从大型密集网络进行的第一次训练和修剪的效果差?

    Q: Why does training directly from sparse networks perform worse than first training from a large dense network and pruning?A: There are more possible subnetworks for the neural network to explore. The eventual goal of a neural network is to find an optimal sparse subnetwork, but it cannot begin with one.

    :为什么直接从稀疏网络进行的训练要比从大型密集网络进行的第一次训练和修剪的效果差? :神经网络有更多可能的子网可供探索。 神经网络的最终目标是找到一个最佳的稀疏子网络,但是它不能从一个开始。

  • Q: Why does initializing all weights to 0 not perform well?

    :为什么将所有权重初始化为0效果不佳?

    Q: Why does initializing all weights to 0 not perform well?A: Each subnetwork is the same, and hence each is individually weak. There is no initial diversity for the optimizer to identify and grow out strong subnetworks (winning tickets).

    :为什么将所有权重初始化为0效果不佳? :每个子网都是相同的,因此每个子网都是独立的。 优化器没有最初的多样性来识别和壮大强大的子网(中奖票)。

  • Q: Why can overparametrized, massive neural networks still learn?

    :为什么参数过多的大规模神经网络仍然可以学习?

    Q: Why can overparametrized, massive neural networks still learn?A: Only a segment of the neural network — the winning tickets — provide the generalization ‘learning’ component. These are identified and developed by the optimizer because they begin at convenient values.

    :为什么参数过多的大规模神经网络仍然可以学习? :只有神经网络的一部分(中奖票)提供了通用的“学习”组件。 这些是由优化程序识别和开发的,因为它们从方便的值开始。

The authors provide a formal definition:

作者提供了正式的定义:

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.

随机初始化的密集神经网络包含一个子网络,该子网络经过初始化,以便在单独训练时可以在训练最多相同数量的迭代后匹配原始网络的测试精度。

- The Lottery Ticket Hypothesis

-彩票假设

Rephrased, if one were to take only the subnetwork and train it, the performance would match or beat the original network under the same training conditions.

换句话说,如果只接受子网并对其进行培训,那么在相同的培训条件下,性能将达到或击败原始网络。

In order to find these tickets, the authors propose Iterative Magnitude Pruning (IMP), which operates by the following steps:

为了找到这些票证,作者提出了迭代幅度修剪(IMP),该过程按以下步骤操作:

  1. Begin with a dense initialization and train the network.

    从密集的初始化开始,然后训练网络。
  2. Determine the s% smallest magnitude weights (these are the weights that don’t have important values and hence contain little information).

    确定s %的最小量级权重(这些权重没有重要值,因此包含的信息很少)。

  3. Create a binary mask (multiply by 1 to keep, multiply by 0 to discard) to prune these weights.

    创建一个二进制掩码(乘以1表示保留,乘以0表示丢弃)以修剪这些权重。
  4. Retrain the sparse network with its previous initial weights (retrain the network with the same weights, but with the mask such that some of the weights no longer have an input in the process).

    使用先前的初始权重来训练稀疏网络(使用相同的权重来训练网络,但是要使用掩码来使网络中的某些权重不再具有输入)。
  5. Repeat the pruning process (steps 2–4) until the desired sparsity is reached or the test error has gotten too large.

    重复修剪过程(步骤2-4),直到达到所需的稀疏性或测试错误变得太大为止。
Image for post

This idea, however, can be taken even further. Ramanujan et al. found that within a randomly weighted Wide ResNet 50 lay a randomly weighted subnetwork smaller than but matching in performance with a ResNet 34 trained on ResNet. Initially, networks contain subnetworks that can achieve impressive performance without its weights ever touching data.

但是,这个想法可以更进一步。 Ramanujan等。 发现在随机加权的Wide ResNet 50中放置了一个随机加权的子网,该子网小于但在性能上与ResNet训练的ResNet 34相匹配。 最初,网络包含子网,这些子网可以在不影响数据的情况下实现出色的性能。

The researchers propose a complimentary conjecture to the Lottery Ticket Hypothesis: “within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnetwork that achieves competitive accuracy”.

研究人员提出了一个关于彩票假说的补充猜想:“在一个具有随机权重的充分超参数化的神经网络中(例如在初始化时),存在一个可以达到竞争准确性的子网”。

An ‘edge-popup’ algorithm is used to find these randomly weighted winning tickets. While each edge (connection)’s value remains unchanged, it is associated with a score (that may change).

“边缘弹出”算法用于查找这些随机加权的中奖彩票。 虽然每个边(连接)的值保持不变,但与得分(可能会改变)相关联。

Each iteration, the edges corresponding to the top k% of scores are selected and the rest pruned. Based on the performance of this sparse subnetwork, stochastic gradient descent is used to update all of the scores optimally; this allows for edges that have been pruned mistakenly to come back in the future.

每次迭代时,选择与得分的前k %相对应的边,然后修剪其余部分。 基于此稀疏子网的性能,可以使用随机梯度下降来最佳地更新所有分数。 这允许错误修剪的边缘在将来返回。

Image for post
Ramanujan et al. Image free to share. Ramanujan等。 图片免费分享。

Essentially, this pruning technique is the same as how one would train a regular neural network; the difference is that scores, which determine the architecture of a network, instead of weights, are tuned. Using this algorithm, random subnetworks can be chosen in a manner so effective that the performance is comparable with top-performing architectures.

本质上,这种修剪技术与训练常规神经网络的方法相同; 区别在于调整了决定网络结构而不是权重的分数。 使用此算法,可以以一种有效的方式选择随机子网,以使其性能与性能最佳的体系结构相当。

As intuition, the authors offer an example — consider an untrained neural network N with an arbitrary width (the number of channels in the case of CNNs) with weights initialized to a normal distribution (standard procedures like Xavier or Kaiming). Let τ be a network of the same architecture trained such that it attains good performance.

作为直觉,作者提供了一个示例-考虑一个未经训练的神经网络N ,该网络具有任意宽度(在CNN情况下为通道数),其权重已初始化为正态分布(标准程序,如Xavier或Kaiming)。 令τ为经过训练使得具有良好性能的相同体系结构的网络。

Let q be the probability that a subnetwork of N has weights that are close enough to τ to obtain the same (or better) performance. Instinctively, q is very small, but it is regardless not zero (it is possible). Hence, the probability that a subnetwork of N does not obtain similar performance to τ is (1−q), which is fairly large.

qN的子网络具有足够接近τ以获得相同(或更好)性能的权重的概率。 本能地, q非常小,但无论如何都不为零(有可能)。 因此, N的子网无法获得与τ相似的性能的概率是(1- q ),这是相当大的。

The probability, on the other hand, that no subnetwork of N obtains a similar performance to τ is (1−q)ˢ, where s is the number of subnetworks. This is because each subnetwork must fail to obtain similar performance to τ. As any simple calculator experiment will yield, raising fractions — even high ones like 0.999999 — to high exponents will result in arbitrarily small values.

另一方面, N子网络没有获得与τ相似的性能的概率是(1- q ) ˢ ,其中s是子网络的数量。 这是因为每个子网都必须无法获得与τ相似的性能。 正如任何简单的计算器实验都会产生的那样,将分数(甚至像0.999999这样的高分数)提高到高指数将得到任意小的值。

Thus, as s increases with the width and depth of the network, the chance that a subnetwork will perform just as well as a well-trained network increases drastically, given that each subnetwork is unique. This is an extension upon reasoning based in the Lottery Ticket Hypothesis.

因此,随着s随网络的宽度和深度的增加而增加,假设每个子网都是唯一的,则子网表现良好以及训练有素的网络的机会将大大增加。 这是基于彩票假设的推理的扩展。

The authors experimented with convolutional neural networks of different depths on the CIFAR-10 dataset. Blue and orange lines represent the learned weights of Adam and SGD optimizers, whereas blue and red lines represent the accuracy of random subnetworks of different initialization methods, across the percent of top-scoring weights kept (x-axis, this is k in the edge popup algorithm).

作者在CIFAR-10数据集上试验了不同深度的卷积神经网络。 蓝线和橙线代表Adam和SGD优化器的学习权重,而蓝线和红线则代表不同初始化方法的随机子网的准确度,该值占所保留的最高得分权重的百分比( x轴,边缘为k弹出算法)。

Note that as the depth of the network increases, a subnetwork with random weights actually performs better than a trained network. From earlier discussed intuition, this at least makes a bit of sense, and undoubtedly shakes up how we view the value and process of neural network training.

请注意,随着网络深度的增加,具有随机权重的子网实际上要比经过训练的网络更好。 从前面所讨论的直觉,这至少使得有点感觉,无疑动摇了我们如何看待的神经网络训练的价值和过程。

Practically, this fascinating finding has little value, but it reveals the need to explore more into the key factors driving learning in neural networks.

实际上,这一引人入胜的发现没有什么价值,但它表明需要探索更多驱动神经网络学习的关键因素。

These recent advances in pruning, the Lottery Ticket Hypothesis, the finding of top-performing weights, and model compression are unravelling large parts of the blackbox nature of neural networks that have too long been ignored. With these findings, researchers may be able to find and develop even more efficient optimization and learning methods.

这些在修剪,彩票假说,发现最佳权重和模型压缩方面的最新进展揭示了神经网络的黑盒性质的很大一部分,而这些问题早已被人们忽略了。 有了这些发现,研究人员也许能够找到并开发更有效的优化和学习方法。

概要 (Summary)

  • The success of neural network pruning, among other studies that debunk the idea that regularization is at the root of generalization, point to an important question as to exactly how and why neural networks appear to generalize while simultaneously appearing to overfit.

    神经网络修剪的成功,除其他研究外,还驳斥了正则化是泛化的根源的观点,指出了一个重要的问题,即神经网络究竟如何以及为何泛化而同时又显得过度拟合。
  • The Lottery Ticket Hypothesis states that neural networks are really running large lotteries where subnetworks initialized to convenient values are indirectly chosen and developed by the optimizer to process most of the information flow.

    彩票假说指出,神经网络实际上正在运行大型彩票,其中优化程序间接选择和开发了初始化为方便值的子网,以处理大多数信息流。
  • In a sufficiently large untrained and randomly initialized network, one can find a subnetwork with random weights that performs as well as an unpruned, trained network.

    在足够大的未经训练和随机初始化的网络中,可以找到一个具有随机权重的子网,其性能与未经修剪​​,经过训练的网络一样好。

推荐读物 (Recommended Reading)

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/obtaining-top-neural-network-performance-without-any-training-5af0af464c59

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值