Google Tuning book 学习笔记

(图片发送失败,有兴趣的可以私信我发送原文档)

  1. 为什么要学习这个调参指南?

教科书和论文中调参说明太少,反而在博客等地方有零散的说明。这就给开发人员系统学习带来了很大的困难。

  1. 什么时候进行调参优化
    1. 问题制定、数据清理等基本工作已经完成得足够多,因此花时间在模型架构和训练配置上是有意义的。
    2. 训练,测试和评估的pipeline已经被建立
    3. 适当的模型评估指标已经被选择应用了,且能在部署环境中测试确认。
  2. 一些前置工作
    1. 选择合适的模型架构
    2. 选择优化器: 使用成熟的,流行的优化器,在项目的初始阶段,最好从更简单的优化器(例如,具有固定动量的 SGD 或具有固定 

      的 Adam )开始,然后再切换到更通用的优化器。

    1. 选择batch size大小

增加batch size能减少训练时间:

只要所有超参数都经过良好调整(特别是学习率和正则化超参数)并且训练步骤的数量足够,使用任何批量大小都应该可以获得相同的最终性能

如何选择batch size大小以适配GPU MEMORY?

>最简单的解决方案通常是以不同的批量大小运行训练作业(例如,增加 2 的幂)进行少量步骤,直到其中一个作业超过可用内存。

调整batch size的影响?

与批大小交互最强烈的超参数是优化器超参数(例如学习率、动量)和正则化超参数,因此对于每个批大小分别进行调优最为重要。

    1. 一些初始化配置

原则:简单

  1. 例如在添加花哨的衰减计划之前,从恒定的学习率开始。
  2. 从较小的模型开始。
  3. 训练更多epoch使调参更容易
  4. 训练少epoch有利于快速迭代调整

  1. 增量调优策略
    1. 超参空间自动搜索的问题

可能的配置空间非常大,目前还没有足够复杂的算法在没有人类指导的情况下有效地搜索这个空间。

  1. 最佳策略:我们在每一轮调整中使用自动搜索算法,并随着我们理解的增长不断更新我们的搜索空间。
    1. 每一轮迭代应该有明确的优化目标

    1. 超参数的分类

对于给定的目标,所有超参数都将是

  1. 科学超参数 对实验结果,metric有影响的超参数

例如,如果我们的目标是“确定具有更多隐藏层的模型是否会减少验证误差”,那么隐藏层的数量就是一个科学超参数。

  1. 有害超参数 需要被优化的

学习率是一个令人讨厌的超参数,因为只有当学习率针对每个层数单独调整时,我们才能公平地比较具有不同隐藏层数的模型(最佳学习率通常取决于模型架构)。

  1. 固定超参数 当前轮实验中被固定的

如果我们在之前的实验中确定激活函数的最佳选择对模型深度不敏感,或者如果我们愿意将关于隐藏层数的结论限制为仅涵盖激活函数的特定选择,则激活函数可能是一个固定的超参数。或者,如果我们准备为每个数量的隐藏层单独调整它,它可能是一个令人讨厌的参数。

以上三种可能在不同迭代试验中互相转换

例如,激活函数的选择可以是科学超参数(ReLU 或 tanh 是我们问题的更好选择吗?),一个令人讨厌的超参数(当我们允许几种不同的可能激活函数时,最好的 5 层模型是否比最好的 6 层模型更好?),或者一个固定的超参数(对于 ReLU 网络,在特定位置添加批量归一化是否有帮助?)。

★策略(设计一轮新实验时):

首先确定实验目标的科学超参数-->认为所有其他超参数都是令人讨厌(有害)的超参数-->将一些令人讨厌的超参数转换为固定的超参数

★经验设置

  • 优化器的选择通常是科学超参数或固定超参数。
  • 正则化技术引入的超参数通常是令人讨厌的超参数例如,dropout 增加了代码的复杂性,因此在决定是否包含它时,我们会将“no dropout”与“dropout”作为科学超参数,并将 dropout 率设为令人讨厌的超参数。
  • 体系结构超参数通常是科学或固定的超参数,因为体系结构更改可能会影响服务和训练成本、延迟和内存要求例如,层数通常是一个科学或固定的超参数,因为它往往会对训练速度和内存使用产生巨大影响。

    1. Striking a balance between informative and affordable experiments

三个目标:

  1. Extracting insight from experimental results
    1. 检查训练曲线
  • 当发现验证集效果开始降低时,有可能是训练开始过拟合了。
  • 如果任何最好的试验表现出有问题的过拟合,我们通常希望在比较科学超参数的值之前,使用其他正则化技术和/或更好地调整现有的正则化参数。
  • 使用常见的正则化技术来减少过度拟合通常很简单,这些技术会增加最小的代码复杂性或额外的计算(例如dropout,标签平滑,权重衰减),因此将其中一个或多个添加到下一轮实验中通常没什么大不了的。例如,如果科学超参数是“隐藏层数”,而使用最多隐藏层数的最佳试验表现出有问题的过拟合,那么我们通常更愿意使用额外的正则化再次尝试,而不是立即选择较少数量的隐藏层。
5.2 训练中是否存在很高的步进方差或训练后期的验证误差?
  • 如果是这样,这可能会干扰我们比较不同科学超参数值的能力(因为每个试验都随机以“幸运”或“不幸”步骤结束)以及我们在生产中重现最佳试验结果的能力(因为生产模型可能不会以与研究中相同的“幸运”步骤结束)。
  • 分步方差的最可能原因是批次方差(从每个批次的训练集中随机抽样样本)、小验证集以及在训练后期使用过高的学习率。
  • 可能的补救措施包括增加批量大小、获取更多验证数据、使用学习率衰减或使用 Polyak 平均。

    1. 观察训练到后期时,指标(例如Loss)是否仍然在改善还是很早之前就停止了优化,来决定策略(增加或减少训练步骤)

  1. Determining whether to adopt a training pipeline change or hyperparameter configuration

    1. 训练结果不一致的来源:
  • 训练过程方差、再训练方差或试验方差:我们在使用相同超参数但不同随机种子的训练运行之间看到的变化。例如,不同的随机初始化、训练数据随机、辍学掩码、数据增强操作模式和并行算术运算排序都是试验方差的潜在来源。 最佳试验N次(Therefore, before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.)?
  • 超参数搜索方差或研究方差:由我们选择超参数的过程引起的结果变化。例如,我们可能会对特定的搜索空间运行相同的实验,但使用两个不同的种子进行准随机搜索,并最终选择不同的超参数值。
  • 数据收集和采样方差:从任何类型的数据随机拆分为训练、验证和测试数据的方差,或由于训练数据生成过程引起的方差。
  • However, we should only adopt changes that produce improvements that outweigh any complexity they add.‘

7.FAQ:

What is the best learning rate decay schedule family?
什么是最佳学习率衰减时间表系列?

  • It’s an open problem. It’s not clear how to construct a set of rigorous experiments to confidently answer what the "best" LR decay schedule is.
    这是一个悬而未决的问题。目前尚不清楚如何构建一组严格的实验来自信地回答“最佳”LR衰减时间表是什么。
  • Although we don't know the best schedule family, we're confident that it’s important to have some (non-constant) schedule and that tuning it matters.
    虽然我们不知道最佳时间表系列,但我们相信有一些(非恒定)时间表很重要,并且调整它很重要。
  • Different learning rates work best at different times during the optimization process. Having some sort of schedule makes it more likely for the model to hit a good learning rate.
    在优化过程中,不同的学习率在不同时间效果最好。拥有某种时间表会使模型更有可能达到良好的学习率。

Which learning rate decay should I use as a default?
我应该使用哪种学习率衰减作为默认值?

  • Our preference is either linear decay or cosine decay, and a bunch of other schedule families are probably good too.
    我们的偏好是线性衰减或余弦衰减,一堆其他时间表家族可能也很好。

Why do some papers have complicated learning rate schedules?
为什么有些试卷的学习率很复杂?

  • It’s not uncommon to see papers with complicated piecewise learning rate (LR) decay schedules.
    具有复杂的分段学习率(LR)衰减时间表的论文并不少见。
  • Readers often wonder how the authors arrived at such a complicated schedule.
    读者经常想知道作者是如何得出如此复杂的时间表的。
  • Many complicated LR decay schedules are the result of tuning the schedule as a function of the validation set performance in an ad hoc way:
    许多复杂的 LR 衰减计划是以临时方式调整计划作为验证集性能函数的结果:
    1. Start a single training run with some simple LR decay (or a constant learning rate).
      使用一些简单的 LR 衰减(或恒定学习率)开始单个训练运行。
    2. Keep training running until the performance seems to stagnate. If this happens, pause training. Resume it with a perhaps steeper LR decay schedule (or smaller constant learning rate) from this point. Repeat this process until the conference/launch deadline.
      继续训练,直到性能似乎停滞不前。如果发生这种情况,请暂停训练。从这一点开始,使用可能更陡峭的LR衰减时间表(或更小的恒定学习率)来恢复它。重复此过程,直到会议/启动截止日期。
  • Blithely copying the resulting schedule is generally not a good idea since the best particular schedule will be sensitive to a host of other hyperparameter choices.
    轻率地复制生成的计划通常不是一个好主意,因为最好的特定计划将对许多其他超参数选择敏感。
    • Better to copy the algorithm that produced the schedule, although this is rarely possible when arbitrary human judgment produced the schedule.
      最好复制生成时间表的算法,尽管当任意的人为判断产生时间表时,这几乎是不可能的。
  • This type of validation-error-sensitive schedule is fine to use if it can be fully automated, but human-in-the-loop schedules that are a function of validation error are brittle and not easily reproducible, so we recommend avoiding them.
    如果可以完全自动化,则可以使用这种类型的验证错误敏感计划,但是作为验证误差函数的人机在环计划很脆弱,不容易重现,因此我们建议避免使用它们。
    • Before publishing results that used such a schedule, please try to make it fully reproducible.
      在使用此类时间表发布结果之前,请尝试使其完全可重现。

How should Adam’s hyperparameters be tuned?
应该如何调整亚当的超参数?

  • As discussed above, making general statements about search spaces and how many points one should sample from the search space is very difficult. Note that not all the hyperparameters in Adam are equally important. The following rules of thumb correspond to different "budgets" for the number of trials in a study.
    如上所述,对搜索空间以及应该从搜索空间中抽取多少点进行一般性陈述是非常困难的。请注意,并非 Adam 中的所有超参数都同样重要。以下经验法则对应于研究中试验数量的不同“预算”。
    • If < 10 trials in a study, only tune the (base) learning rate.
      如果一项研究中< 10 次试验,则仅调整(基础)学习率。
    • If 10-25 trials, tune learning rate and �1.
      如果进行 10-25 次试验,请调整学习率和 �1 .
    • If 25+ trials, tune the learning rate, �1 and �.
      如果 25+ 次试用,请调整学习率, �1 然后 � .
    • If one can run substantially more than 25 trials, additionally tune �2.
      如果可以运行 25 个以上的试验,请另外调整 �2 .

Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?
为什么在调优的探索阶段使用准随机搜索而不是更复杂的黑盒优化算法?

  • Quasi-random search (based on low-discrepancy sequences) is our preference over fancier black box optimization tools when used as part of an iterative tuning process intended to maximize insight into the tuning problem (what we refer to as the "exploration phase"). Bayesian optimization and similar tools are more appropriate for the exploitation phase.
    准随机搜索(基于低差异序列)是我们的偏好,而不是更高级的黑盒优化工具,当用作迭代调优过程的一部分时,旨在最大限度地洞察调优问题(我们称之为“探索阶段”)。贝叶斯优化和类似工具更适合开发阶段。
  • Quasi-random search based on randomly shifted low-discrepancy sequences can be thought of as "jittered, shuffled grid search", since it uniformly, but randomly, explores a given search space and spreads out the search points more than random search.
    基于随机移动的低差异序列的准随机搜索可以被认为是“抖动,随机的网格搜索”,因为它均匀但随机地探索给定的搜索空间,并且比随机搜索更分散搜索点。
  • The advantages of quasi-random search over more sophisticated black box optimization tools (e.g. Bayesian optimization, evolutionary algorithms) include:
    与更复杂的黑盒优化工具(例如贝叶斯优化、进化算法)相比,准随机搜索的优势包括:
    1. Sampling the search space non-adaptively makes it possible to change the tuning objective in post hoc analysis without rerunning experiments.
      对搜索空间进行非自适应采样可以在事后分析中更改调整目标,而无需重新运行实验。
      • For example, we usually want to find the best trial in terms of validation error achieved at any point in training. But the non-adaptive nature of quasi-random search makes it possible to find the best trial based on final validation error, training error, or some alternative evaluation metric without rerunning any experiments.
        例如,我们通常希望在训练的任何点上实现验证误差方面找到最佳试验。但是准随机搜索的非适应性使得可以根据最终验证误差、训练误差或一些替代评估指标找到最佳试验,而无需重新运行任何实验。
    2. Quasi-random search behaves in a consistent and statistically reproducible way.
      准随机搜索的行为一致且统计上可重复。
      • It should be possible to reproduce a study from six months ago even if the implementation of the search algorithm changes, as long as it maintains the same uniformity properties. If using sophisticated Bayesian optimization software, the implementation might change in an important way between versions, making it much harder to reproduce an old search. It isn’t always possible to roll back to an old implementation (e.g. if the optimization tool is run as a service).
        即使搜索算法的实现发生变化,只要它保持相同的均匀性属性,也应该可以重现六个月前的研究。如果使用复杂的贝叶斯优化软件,则实现可能会在版本之间发生重要变化,从而使重现旧搜索变得更加困难。并不总是可以回滚到旧的实现(例如,如果优化工具作为服务运行)。
    3. Its uniform exploration of the search space makes it easier to reason about the results and what they might suggest about the search space.
      它对搜索空间的统一探索使得更容易推理结果以及它们可能对搜索空间的建议。
      • For example, if the best point in the traversal of quasi-random search is at the boundary of the search space, this is a good (but not foolproof) signal that the search space bounds should be changed. This section goes into more depth. However, an adaptive black box optimization algorithm might have neglected the middle of the search space because of some unlucky early trials even if it happens to contain equally good points, since it is this exact sort of non-uniformity that a good optimization algorithm needs to employ to speed up the search.
        例如,如果准随机搜索遍历中的最佳点位于搜索空间的边界,则这是一个很好的(但并非万无一失)的信号,表明搜索空间边界应该改变。本节将更深入地介绍。然而,自适应黑盒优化算法可能会因为一些不幸的早期试验而忽略搜索空间的中间,即使它恰好包含同样好的点,因为一个好的优化算法需要采用这种精确的不均匀性来加速搜索。
    4. Running different numbers of trials in parallel versus sequentially will not produce statistically different results when using quasi-random search (or other non-adaptive search algorithms), unlike with adaptive algorithms.
      与自适应算法不同,使用准随机搜索(或其他非自适应搜索算法)时,并行与顺序运行不同数量的试验不会产生统计上不同的结果。
    5. More sophisticated search algorithms may not always handle infeasible points correctly, especially if they aren't designed with neural network hyperparameter tuning in mind.
      更复杂的搜索算法可能并不总是正确处理不可行的点,特别是如果它们在设计时没有考虑到神经网络超参数调优。
    6. Quasi-random search is simple and works especially well when many tuning trials will be running in parallel.
      准随机搜索很简单,当许多调优试验将并行运行时,效果特别好。
      • Anecdotally1, it is very hard for an adaptive algorithm to beat a quasi-random search that has 2X its budget, especially when many trials need to be run in parallel (and thus there are very few chances to make use of previous trial results when launching new trials).
        有趣的是,自适应算法很难击败预算为2倍的准随机搜索 1 ,特别是当许多试验需要并行运行时(因此在启动新试验时很少有机会利用以前的试验结果)。
      • Without expertise in Bayesian optimization and other advanced black box optimization methods, we might not achieve the benefits they are, in principle, capable of providing. It is hard to benchmark advanced black box optimization algorithms in realistic deep learning tuning conditions. They are a very active area of current research, and the more sophisticated algorithms come with their own pitfalls for inexperienced users. Experts in these methods are able to get good results, but in high-parallelism conditions the search space and budget tend to matter a lot more.
        如果没有贝叶斯优化和其他高级黑盒优化方法的专业知识,我们可能无法实现它们原则上能够提供的好处。很难在真实的深度学习调优条件下对高级黑盒优化算法进行基准测试。它们是当前研究的一个非常活跃的领域,更复杂的算法对于没有经验的用户来说也有自己的陷阱。这些方法的专家能够获得良好的结果,但在高并行性条件下,搜索空间和预算往往更重要。
  • That said, if our computational resources only allow a small number of trials to run in parallel and we can afford to run many trials in sequence, Bayesian optimization becomes much more attractive despite making our tuning results harder to interpret.
    也就是说,如果我们的计算资源只允许少量试验并行运行,并且我们可以按顺序运行许多试验,那么贝叶斯优化将变得更具吸引力,尽管我们的调优结果更难解释。

Where can I find an implementation of quasi-random search?
在哪里可以找到准随机搜索的实现?

How many trials are needed to get good results with quasi-random search?
需要多少次试验才能通过准随机搜索获得良好的结果?

Figure 3: A ResNet-50 was tuned on ImageNet with 100 trials. Via bootstrapping, different amounts of tuning budget were simulated. Box plots of the best performances for each trial budget are plotted above.
图 3:在 ImageNet 上调整了 ResNet-50,进行了 100 次试验。通过自举,模拟了不同数量的调优预算。每个试用预算的最佳性能的箱形图如上图所示。

  • There is no way to answer this question in general, but we can look at specific examples.
    一般没有办法回答这个问题,但我们可以看看具体的例子。
  • As the Figure 3 shows, the number of trials in a study can have a substantial impact on the results.
    如图3所示,研究中的试验数量会对结果产生重大影响。
    • Notice how large the interquartile ranges are when 6 trials were sampled, versus when 20 trials were sampled.
      请注意,抽样 6 项试验时与抽样 20 项试验时四分位距有多大。
    • Even with 20 trials, it is likely that the difference between especially lucky and unlucky studies will be larger than the typical variation between re-trains of this model on different random seeds, with fixed hyperparameters, which for this workload might be around +/- 0.1% on a validation error rate of ~23%.
      即使有 20 次试验,特别幸运和不幸的研究之间的差异也可能大于该模型在不同随机种子上重新训练的典型差异,具有固定超参数,对于此工作负载,在验证错误率为 ~23% 的情况下,该差异可能约为 +/- 0.1%。

How can optimization failures be debugged and mitigated?
如何调试和缓解优化失败?

Summary: If the model is experiencing optimization difficulties, it’s important to fix them before trying other things. Diagnosing and correcting training failures is an active area of research.
摘要:如果模型遇到优化困难,请务必在尝试其他操作之前修复它们。诊断和纠正培训失败是一个活跃的研究领域。

Figure 4: Changing the strides in a single residual block (2x2 -> 1x1) in a WideResnet results in training instability. This does not degrade performance at low learning rates, but high learning rates no longer train well due to the instability. Applying 1000 steps of learning rate warmup resolves this particular instance of instability, allowing stable training at max learning rate of .1.
图 4:在 WideResnet 中更改单个残差块 (2x2 -> 1x1) 中的步幅会导致训练不稳定。这不会降低低学习率下的表现,但由于不稳定,高学习率不再训练良好。应用 1000 步学习率热身解决了这种特殊的不稳定情况,允许以 .1 的最大学习率进行稳定的训练。

Identifying unstable workloads
识别不稳定的工作负载

  • Any workload will become unstable if the learning rate is too large. Instability is only an issue when it forces us to use a learning rate that’s too small.
    如果学习率太大,任何工作负载都会变得不稳定。只有当不稳定迫使我们使用太小的学习率时,它才是一个问题。
  • There are at least two types of training instability worth distinguishing:
    至少有两种类型的训练不稳定值得区分:
    1. Instability at initialization/early in training.
      初始化时/训练早期不稳定。
    2. Sudden instability in the middle of training.
      训练中突然不稳定。
  • We can take a systematic approach to identifying stability issues in our workload.
    我们可以采取系统的方法来确定工作负载中的稳定性问题。
    1. Do a learning rate sweep and find the best learning rate lr*.
      进行学习率扫描并找到最佳学习率 lr*。
    2. Plot training loss curves for learning rates just above lr*.
      绘制略高于 lr* 的学习率的训练损失曲线。
    3. If the learning rates > lr* show loss instability (loss goes up not down during periods of training), then it is likely that fixing the instability will result in better training.
      如果学习率>lr*显示损失不稳定(在训练期间损失上升而不是下降),那么修复不稳定性可能会导致更好的训练。
  • Log the L2 norm of the full loss gradient during training, outlier values can result in spurious instability in the middle of training. This can inform how to pick gradient/update clipping.
    在训练期间记录完整损失梯度的 L2 范数,异常值可能会导致训练过程中出现杂散不稳定。这可以告知如何选择渐变/更新裁剪。

NOTE: Some models show very early instability followed by a recovery that results in slow but stable training. Common evaluation schedules can miss these issues by not evaluating frequently enough!
注意:有些模型显示出非常早期的不稳定,然后恢复,导致缓慢但稳定的训练。常见的评估计划可能会因为评估频率不够频繁而错过这些问题!

To check for this, we can train for an abbreviated run of just ~500 steps using lr = 2 * current best, but evaluate every step.
为了检查这一点,我们可以使用 lr = 2 * current best 来训练仅 ~500 步的简短运行,但评估每一步。

Figure 5: Illustration of the value of more frequent evaluations at the start of training. Useful if there’s a suspicion that the model suffers from early training instability.
图 5:说明在培训开始时进行更频繁评估的价值。如果怀疑模型存在早期训练不稳定,则很有用。

Potential fixes for common instability patterns
常见不稳定模式的潜在修复

  • Apply learning rate warmup
    应用学习率预热
    • Best for early training instability.
      最适合早期训练不稳定。
  • Apply gradient clipping  应用渐变剪切
    • Good for both early and mid training instability, may fix some bad inits that warmup cannot.
      对早期和中期训练不稳定都有好处,可能会修复一些热身无法解决的不良问题。
  • Try a new optimizer
    尝试新的优化器
    • Sometimes Adam can handle instabilities that Momentum can’t. This is an active area of research.
      有时亚当可以处理动量无法处理的不稳定。这是一个活跃的研究领域。
  • We can ensure that we’re using best practices/initializations for our model architecture (examples below).
    我们可以确保为模型架构使用最佳实践/初始化(下面的示例)。
    • Add residual connections and normalization if the model doesn't contain it already.
      添加残差连接和规范化(如果模型尚未包含)。
  • Normalization should be the last operation before the residual. E.g. x + Norm(f(x)).
    归一化应该是残差之前的最后一个操作。例如 x + 范数(f(x))。
  • Norm(x + f(x)) known to cause issues.
    已知会导致问题的规范(x + f(x))。
  • Try initializing residual branches to 0 (e.g. ReZero init).
    尝试将残差分支初始化为 0(例如 ReZero init)。
  • Lower the learning rate
    降低学习率
    • This is a last resort.
      这是最后的手段。

Learning rate warmup 学习率预热

Figure 6: An example of instability during a warmup period (note the horizontal axis log scale). 40k steps of warmup was needed for successful training in this case.
图 6:预热期间的不稳定示例(注意水平轴对数刻度)。在这种情况下,需要 40k 步的热身才能成功训练。

When to apply learning rate warmup
何时应用学习率预热

Figure 7a: An example of a hyperparameter axis plot for a model exhibiting training instability. The best learning rate is at the edge of what is feasible. An "infeasible" trial is defined as one that either produces NaNs or uncharacteristically high values of the loss.
图 7a:表现出训练不稳定的模型的超参数轴图示例。最好的学习率处于可行的边缘。“不可行”的试验被定义为产生NaN或异常高的损失值的试验。

Figure 7b: The training loss of a model trained with a learning rate where we see instability.
图 7b:使用学习率训练的模型的训练损失,我们看到不稳定。

  • Figure 7a shows a hyperparameter axis plot that indicates a model experiencing optimization instabilities, because the best learning rate is right at the edge of instability.
    图 7a 显示了一个超参数轴图,该图指示模型遇到优化不稳定,因为最佳学习率正好处于不稳定的边缘。
  • Figure 7b shows how this can be double-checked by examining the training loss of a model trained with a learning rate either 5x or 10x larger than this peak. If that plot shows a sudden rise in the loss after a steady decline (e.g. at step ~10k in the figure above), then the model likely suffers from optimization instability.
    图 7b 显示了如何通过检查学习率比该峰值大 5 倍或 10 倍的训练模型的训练损失来仔细检查这一点。如果该图显示损失在稳定下降后突然上升(例如,在上图中的步长~10k处),则模型可能会受到优化不稳定的影响。

How to apply learning rate warmup
如何应用学习率热身

Figure 8: Beneficial effect of learning rate warmup on addressing training instabilities.
图 8:学习率预热对解决训练不稳定性的有益影响。

  • Using the section immediately above, we assume that the practitioner has already identified the learning rate at which the model becomes unstable. This is the unstable_base_learning_rate.
    使用上面的部分,我们假设从业者已经确定了模型变得不稳定的学习率。这是 unstable_base_learning_rate .
  • Warmup involves prepending a learning rate schedule that ramps up the learning rate from 0 to some stable base_learning_rate, that is at least one order of magnitude larger than unstable_base_learning_rate. The default would be to try a base_learning_rate that’s 10x unstable_base_learning_rate. Although note that it’d be possible to run this entire procedure again for something like 100x unstable_base_learning_rate. The specific schedule is:
    热身包括预先预置一个学习率计划,将学习率从 0 提高到某个稳定 base_learning_rate ,至少比 大 unstable_base_learning_rate 一个数量级。默认是尝试 base_learning_rate 10x unstable_base_learning_rate .尽管请注意,可以再次运行整个过程,例如 100x unstable_base_learning_rate .具体时间表如下:
    • Ramp up from 0 to base_learning_rate over warmup_steps.
      从 0 上升到 base_learning_rate 超过 warmup_steps 。
    • Train at a constant rate for post_warmup_steps.
      以恒定速率训练 post_warmup_steps .
  • Our goal is to find the shortest number of warmup_steps that allows us to access peak learning rates that are much higher than unstable_base_learning_rate.
    我们的目标是找到允许我们访问远高于 unstable_base_learning_rate 的峰值学习率的最短数量 warmup_steps 。
  • So for each base_learning_rate, we need to tune warmup_steps and post_warmup_steps. It’s usually fine to set post_warmup_steps to be 2*warmup_steps.
    所以对于每一个 base_learning_rate ,我们需要调谐 warmup_steps 和 post_warmup_steps 。通常可以设置为 post_warmup_steps 2*warmup_steps .
  • Warmup can be tuned independently of an existing decay schedule. warmup_steps should be swept at a few different orders of magnitude. For example, an example study could try [10, 103, 104, 105]. The largest feasible point shouldn't be more than 10% of max_train_steps.
    预热可以独立于现有的衰减时间表进行调整。 warmup_steps 应该扫几个不同的数量级。例如,一个示例研究可以尝试 [10, 10, 10 4 , 10 5 3 ]。最大可行点不应超过 max_train_steps 10%。
  • Once a warmup_steps that doesn't blow up training at base_learning_rate has been established, it should be applied to the baseline model. Essentially, we prepend this schedule onto the existing schedule, and use the optimal checkpoint selection discussed above to compare this experiment to the baseline. For example, if we originally had 10,000 max_train_steps and did warmup_steps for 1000 steps, the new training procedure should run for 11,000 steps total.
    一旦建立了不会爆炸训练 base_learning_rate 的模型,就应该将其应用于基线 warmup_steps 模型。从本质上讲,我们将此时间表附加到现有时间表之前,并使用上面讨论的最佳检查点选择将此实验与基线进行比较。例如,如果我们最初有 10,000 个步骤, max_train_steps 并且执行 warmup_steps 了 1000 个步骤,则新的训练过程总共应运行 11,000 个步骤。
  • If long warmup_steps are required for stable training (>5% of max_train_steps), max_train_steps may need to be increased to account for this.
    如果需要长期 warmup_steps 稳定训练(>5%) max_train_steps max_train_steps ,可能需要增加以说明这一点。
  • There isn't really a "typical" value across the full range of workloads. Some models only need 100 steps, while others (particularly transformers) may need 40k+.
    在整个工作负载范围内,实际上并没有“典型”值。有些型号只需要 100 步,而其他型号(尤其是变压器)可能需要 40k+。

Gradient clipping 渐变剪切

Figure 9: Illustration of gradient clipping correcting early training instability.
图 9:梯度裁剪图示纠正早期训练不稳定性。

  • Gradient clipping is most useful when large or outlier gradient issues occur.
    梯度裁剪在发生较大或异常梯度问题时最有用。
  • Clipping can fix either early training instability (large gradient norm early), or mid training instabilities (sudden gradient spikes mid training).
    裁剪可以修复早期训练不稳定(早期大梯度范数)或训练中期不稳定(训练中期突然出现梯度峰值)。
  • Sometimes longer warmup periods can correct instabilities that clipping does not: see this section above.
    有时,较长的预热时间可以纠正剪裁无法纠正的不稳定性:请参阅上面的本节。
    • ��� What about clipping during warmup?
      ��� 热身时剪裁怎么办?
  • The ideal clip thresholds are just above the "typical" gradient norm.
    理想的剪切阈值略高于“典型”梯度标准。
  • Here’s an example of how gradient clipping could be done:
    下面是如何执行渐变裁剪的示例:
    • If the norm of the gradient |�| is greater than the gradient clipping threshold �, then do �′=�×�|�| where �′ is the new gradient.
      如果渐变的范数大于渐变裁剪阈值 � ,则执行 �′=�×�|�| 新渐变 |�| 的位置 �′ 。
  • Log the unclipped gradient norm during training. By default, generate:
    在训练期间记录未剪裁的梯度范数。默认情况下,生成:
    • A plot of gradient norm vs step
      梯度范数与阶跃的关系图
    • A histogram of gradient norms aggregated over all steps
      所有步骤上聚合的梯度范数直方图
  • Choose a gradient clipping threshold based on the 90th percentile of gradient norms.
    根据梯度范数的第 90 个百分位数选择渐变裁剪阈值。
    • The threshold will be workload dependent, but 90% is a good starting point. If it doesn't work, this threshold can be tuned.
      阈值将取决于工作负载,但 90% 是一个很好的起点。如果它不起作用,可以调整此阈值。
    • ��� What about some sort of adaptive strategy?
      ��� 某种适应性策略呢?
  • If we try gradient clipping and the instability issues remain, we can try it harder (i.e. make the threshold smaller).
    如果我们尝试梯度裁剪并且不稳定问题仍然存在,我们可以更努力地尝试(即使阈值更小)。
  • Extremely aggressive gradient clipping is in essence a strange way of reducing the learning rate. If we find ourselves using extremely aggressive clipping, we probably should just cut the learning rate instead.
    极端激进的渐变裁剪本质上是一种降低学习率的奇怪方法。如果我们发现自己使用了非常激进的剪辑,我们可能应该降低学习率。
  • We would usually consider having >50% of the updates getting clipped somehow as "extremely aggressive".
    我们通常会认为>50%的更新以某种方式被剪辑为“非常激进”。
  • If we need to do extremely aggressive gradient clipping to deal with our instability issues, then we might as well reduce the learning rate.
    如果我们需要做非常激进的梯度裁剪来处理我们的不稳定性问题,那么我们不妨降低学习率。

Why do you call the learning rate and other optimization parameters hyperparameters? They are not parameters of any prior distribution.
为什么将学习率和其他优化参数称为超参数?它们不是任何先验分布的参数。

[Click to expand] [点击展开]

  • It is true that the term "hyperparameter" has a precise meaning in Bayesian machine learning and referring to the learning rate and most of the other parameters we tune in deep learning as "hyperparameters" is an abuse of terminology.
    的确,术语“超参数”在贝叶斯机器学习中具有精确的含义,并且将学习率和我们在深度学习中调整的大多数其他参数称为“超参数”是对术语的滥用。
  • We would prefer to use the term "metaparameter" for learning rates, architectural parameters, and all the other things we tune in deep learning, since it avoids the potential for confusion that comes from misusing the word "hyperparameter" (confusion that is especially likely when discussing Bayesian optimization where the probabilistic response surface models have their own true hyperparameters).
    我们更愿意使用术语“元参数”来表示学习率、架构参数以及我们在深度学习中调整的所有其他内容,因为它避免了因滥用“超参数”一词而引起的混淆(在讨论贝叶斯优化时,概率响应面模型有自己的真实超参数时,这种混淆尤其可能)。
  • Unfortunately, although potentially confusing, the term hyperparameter has become extremely common in the deep learning community.
    不幸的是,尽管可能令人困惑,但术语超参数在深度学习社区中已经变得非常普遍。
  • Therefore, for a document, such as this one, intended for a wide audience that includes many people who are unlikely to be aware of this technicality, we made the choice to contribute to one source of confusion in the field in hopes of avoiding another.
    因此,对于像本文档这样的文档,面向广泛的受众,包括许多不太可能意识到这种技术性的人,我们选择为该领域的一个混乱来源做出贡献,希望避免另一个。
  • That said, we might make a different choice when publishing a research paper, and we would encourage others to use "metaparameter" instead in most contexts.
    也就是说,在发表研究论文时,我们可能会做出不同的选择,并且我们会鼓励其他人在大多数情况下使用“元参数”。

Why shouldn't the batch size be tuned to directly improve validation set performance?
为什么不应调整批大小以直接提高验证集性能?

[Click to expand] [点击展开]

  • Changing the batch size without changing any other details of the training pipeline will often affect the validation set performance.
    在不更改训练管道的任何其他详细信息的情况下更改批大小通常会影响验证集性能。
  • However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.
    但是,如果针对每个批大小单独优化训练管道,则两个批大小之间的验证集性能差异通常会消失。
  • The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
    与批大小交互最强烈的超参数是优化器超参数(例如学习率、动量)和正则化超参数,因此对于每个批大小分别进行调优最为重要。
    • Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques.
      由于样本方差,较小的批量大小会给训练算法带来更多的噪声,并且这种噪声可以产生正则化效果。因此,较大的批量大小更容易出现过度拟合,并且可能需要更强的正则化和/或额外的正则化技术。
  • In addition, the number of training steps may need to be adjusted when changing the batch size.
    此外,更改批次大小时,可能需要调整训练步骤的数量。
  • Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see Shallue et al. 2018).
    一旦考虑到所有这些影响,目前没有令人信服的证据表明批量大小会影响最大可实现的验证性能(参见Shallue等人,2018)。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值