Google Tuning book 学习笔记


  1. 为什么要学习这个调参指南?


  1. 什么时候进行调参优化
    1. 问题制定、数据清理等基本工作已经完成得足够多,因此花时间在模型架构和训练配置上是有意义的。
    2. 训练,测试和评估的pipeline已经被建立
    3. 适当的模型评估指标已经被选择应用了,且能在部署环境中测试确认。
  2. 一些前置工作
    1. 选择合适的模型架构
    2. 选择优化器: 使用成熟的,流行的优化器,在项目的初始阶段,最好从更简单的优化器(例如,具有固定动量的 SGD 或具有固定 

      的 Adam )开始,然后再切换到更通用的优化器。

    1. 选择batch size大小

增加batch size能减少训练时间:


如何选择batch size大小以适配GPU MEMORY?

>最简单的解决方案通常是以不同的批量大小运行训练作业(例如,增加 2 的幂)进行少量步骤,直到其中一个作业超过可用内存。

调整batch size的影响?


    1. 一些初始化配置


  1. 例如在添加花哨的衰减计划之前,从恒定的学习率开始。
  2. 从较小的模型开始。
  3. 训练更多epoch使调参更容易
  4. 训练少epoch有利于快速迭代调整

  1. 增量调优策略
    1. 超参空间自动搜索的问题


  1. 最佳策略:我们在每一轮调整中使用自动搜索算法,并随着我们理解的增长不断更新我们的搜索空间。
    1. 每一轮迭代应该有明确的优化目标

    1. 超参数的分类


  1. 科学超参数 对实验结果,metric有影响的超参数


  1. 有害超参数 需要被优化的


  1. 固定超参数 当前轮实验中被固定的



例如,激活函数的选择可以是科学超参数(ReLU 或 tanh 是我们问题的更好选择吗?),一个令人讨厌的超参数(当我们允许几种不同的可能激活函数时,最好的 5 层模型是否比最好的 6 层模型更好?),或者一个固定的超参数(对于 ReLU 网络,在特定位置添加批量归一化是否有帮助?)。




  • 优化器的选择通常是科学超参数或固定超参数。
  • 正则化技术引入的超参数通常是令人讨厌的超参数例如,dropout 增加了代码的复杂性,因此在决定是否包含它时,我们会将“no dropout”与“dropout”作为科学超参数,并将 dropout 率设为令人讨厌的超参数。
  • 体系结构超参数通常是科学或固定的超参数,因为体系结构更改可能会影响服务和训练成本、延迟和内存要求例如,层数通常是一个科学或固定的超参数,因为它往往会对训练速度和内存使用产生巨大影响。

    1. Striking a balance between informative and affordable experiments


  1. Extracting insight from experimental results
    1. 检查训练曲线
  • 当发现验证集效果开始降低时,有可能是训练开始过拟合了。
  • 如果任何最好的试验表现出有问题的过拟合,我们通常希望在比较科学超参数的值之前,使用其他正则化技术和/或更好地调整现有的正则化参数。
  • 使用常见的正则化技术来减少过度拟合通常很简单,这些技术会增加最小的代码复杂性或额外的计算(例如dropout,标签平滑,权重衰减),因此将其中一个或多个添加到下一轮实验中通常没什么大不了的。例如,如果科学超参数是“隐藏层数”,而使用最多隐藏层数的最佳试验表现出有问题的过拟合,那么我们通常更愿意使用额外的正则化再次尝试,而不是立即选择较少数量的隐藏层。
5.2 训练中是否存在很高的步进方差或训练后期的验证误差?
  • 如果是这样,这可能会干扰我们比较不同科学超参数值的能力(因为每个试验都随机以“幸运”或“不幸”步骤结束)以及我们在生产中重现最佳试验结果的能力(因为生产模型可能不会以与研究中相同的“幸运”步骤结束)。
  • 分步方差的最可能原因是批次方差(从每个批次的训练集中随机抽样样本)、小验证集以及在训练后期使用过高的学习率。
  • 可能的补救措施包括增加批量大小、获取更多验证数据、使用学习率衰减或使用 Polyak 平均。

    1. 观察训练到后期时,指标(例如Loss)是否仍然在改善还是很早之前就停止了优化,来决定策略(增加或减少训练步骤)

  1. Determining whether to adopt a training pipeline change or hyperparameter configuration

    1. 训练结果不一致的来源:
  • 训练过程方差、再训练方差或试验方差:我们在使用相同超参数但不同随机种子的训练运行之间看到的变化。例如,不同的随机初始化、训练数据随机、辍学掩码、数据增强操作模式和并行算术运算排序都是试验方差的潜在来源。 最佳试验N次(Therefore, before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.)?
  • 超参数搜索方差或研究方差:由我们选择超参数的过程引起的结果变化。例如,我们可能会对特定的搜索空间运行相同的实验,但使用两个不同的种子进行准随机搜索,并最终选择不同的超参数值。
  • 数据收集和采样方差:从任何类型的数据随机拆分为训练、验证和测试数据的方差,或由于训练数据生成过程引起的方差。
  • However, we should only adopt changes that produce improvements that outweigh any complexity they add.‘


What is the best learning rate decay schedule family?

  • It’s an open problem. It’s not clear how to construct a set of rigorous experiments to confidently answer what the "best" LR decay schedule is.
  • Although we don't know the best schedule family, we're confident that it’s important to have some (non-constant) schedule and that tuning it matters.
  • Different learning rates work best at different times during the optimization process. Having some sort of schedule makes it more likely for the model to hit a good learning rate.

Which learning rate decay should I use as a default?

  • Our preference is either linear decay or cosine decay, and a bunch of other schedule families are probably good too.

Why do some papers have complicated learning rate schedules?

  • It’s not uncommon to see papers with complicated piecewise learning rate (LR) decay schedules.
  • Readers often wonder how the authors arrived at such a complicated schedule.
  • Many complicated LR decay schedules are the result of tuning the schedule as a function of the validation set performance in an ad hoc way:
    许多复杂的 LR 衰减计划是以临时方式调整计划作为验证集性能函数的结果:
    1. Start a single training run with some simple LR decay (or a constant learning rate).
      使用一些简单的 LR 衰减(或恒定学习率)开始单个训练运行。
    2. Keep training running until the performance seems to stagnate. If this happens, pause training. Resume it with a perhaps steeper LR decay schedule (or smaller constant learning rate) from this point. Repeat this process until the conference/launch deadline.
  • Blithely copying the resulting schedule is generally not a good idea since the best particular schedule will be sensitive to a host of other hyperparameter choices.
    • Better to copy the algorithm that produced the schedule, although this is rarely possible when arbitrary human judgment produced the schedule.
  • This type of validation-error-sensitive schedule is fine to use if it can be fully automated, but human-in-the-loop schedules that are a function of validation error are brittle and not easily reproducible, so we recommend avoiding them.
    • Before publishing results that used such a schedule, please try to make it fully reproducible.

How should Adam’s hyperparameters be tuned?

  • As discussed above, making general statements about search spaces and how many points one should sample from the search space is very difficult. Note that not all the hyperparameters in Adam are equally important. The following rules of thumb correspond to different "budgets" for the number of trials in a study.
    如上所述,对搜索空间以及应该从搜索空间中抽取多少点进行一般性陈述是非常困难的。请注意,并非 Adam 中的所有超参数都同样重要。以下经验法则对应于研究中试验数量的不同“预算”。
    • If < 10 trials in a study, only tune the (base) learning rate.
      如果一项研究中< 10 次试验,则仅调整(基础)学习率。
    • If 10-25 trials, tune learning rate and �1.
      如果进行 10-25 次试验,请调整学习率和 �1 .
    • If 25+ trials, tune the learning rate, �1 and �.
      如果 25+ 次试用,请调整学习率, �1 然后 � .
    • If one can run substantially more than 25 trials, additionally tune �2.
      如果可以运行 25 个以上的试验,请另外调整 �2 .

Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?

  • Quasi-random search (based on low-discrepancy sequences) is our preference over fancier black box optimization tools when used as part of an iterative tuning process intended to maximize insight into the tuning problem (what we refer to as the "exploration phase"). Bayesian optimization and similar tools are more appropriate for the exploitation phase.
  • Quasi-random search based on randomly shifted low-discrepancy sequences can be thought of as "jittered, shuffled grid search", since it uniformly, but randomly, explores a given search space and spreads out the search points more than random search.
  • The advantages of quasi-random search over more sophisticated black box optimization tools (e.g. Bayesian optimization, evolutionary algorithms) include:
    1. Sampling the search space non-adaptively makes it possible to change the tuning objective in post hoc analysis without rerunning experiments.
      • For example, we usually want to find the best trial in terms of validation error achieved at any point in training. But the non-adaptive nature of quasi-random search makes it possible to find the best trial based on final validation error, training error, or some alternative evaluation metric without rerunning any experiments.
    2. Quasi-random search behaves in a consistent and statistically reproducible way.
      • It should be possible to reproduce a study from six months ago even if the implementation of the search algorithm changes, as long as it maintains the same uniformity properties. If using sophisticated Bayesian optimization software, the implementation might change in an important way between versions, making it much harder to reproduce an old search. It isn’t always possible to roll back to an old implementation (e.g. if the optimization tool is run as a service).
    3. Its uniform exploration of the search space makes it easier to reason about the results and what they might suggest about the search space.
      • For example, if the best point in the traversal of quasi-random search is at the boundary of the search space, this is a good (but not foolproof) signal that the search space bounds should be changed. This section goes into more depth. However, an adaptive black box optimization algorithm might have neglected the middle of the search space because of some unlucky early trials even if it happens to contain equally good points, since it is this exact sort of non-uniformity that a good optimization algorithm needs to employ to speed up the search.
    4. Running different numbers of trials in parallel versus sequentially will not produce statistically different results when using quasi-random search (or other non-adaptive search algorithms), unlike with adaptive algorithms.
    5. More sophisticated search algorithms may not always handle infeasible points correctly, especially if they aren't designed with neural network hyperparameter tuning in mind.
    6. Quasi-random search is simple and works especially well when many tuning trials will be running in parallel.
      • Anecdotally1, it is very hard for an adaptive algorithm to beat a quasi-random search that has 2X its budget, especially when many trials need to be run in parallel (and thus there are very few chances to make use of previous trial results when launching new trials).
        有趣的是,自适应算法很难击败预算为2倍的准随机搜索 1 ,特别是当许多试验需要并行运行时(因此在启动新试验时很少有机会利用以前的试验结果)。
      • Without expertise in Bayesian optimization and other advanced black box optimization methods, we might not achieve the benefits they are, in principle, capable of providing. It is hard to benchmark advanced black box optimization algorithms in realistic deep learning tuning conditions. They are a very active area of current research, and the more sophisticated algorithms come with their own pitfalls for inexperienced users. Experts in these methods are able to get good results, but in high-parallelism conditions the search space and budget tend to matter a lot more.
  • That said, if our computational resources only allow a small number of trials to run in parallel and we can afford to run many trials in sequence, Bayesian optimization becomes much more attractive despite making our tuning results harder to interpret.

Where can I find an implementation of quasi-random search?

How many trials are needed to get good results with quasi-random search?

Figure 3: A ResNet-50 was tuned on ImageNet with 100 trials. Via bootstrapping, different amounts of tuning budget were simulated. Box plots of the best performances for each trial budget are plotted above.
图 3:在 ImageNet 上调整了 ResNet-50,进行了 100 次试验。通过自举,模拟了不同数量的调优预算。每个试用预算的最佳性能的箱形图如上图所示。

  • There is no way to answer this question in general, but we can look at specific examples.
  • As the Figure 3 shows, the number of trials in a study can have a substantial impact on the results.
    • Notice how large the interquartile ranges are when 6 trials were sampled, versus when 20 trials were sampled.
      请注意,抽样 6 项试验时与抽样 20 项试验时四分位距有多大。
    • Even with 20 trials, it is likely that the difference between especially lucky and unlucky studies will be larger than the typical variation between re-trains of this model on different random seeds, with fixed hyperparameters, which for this workload might be around +/- 0.1% on a validation error rate of ~23%.
      即使有 20 次试验,特别幸运和不幸的研究之间的差异也可能大于该模型在不同随机种子上重新训练的典型差异,具有固定超参数,对于此工作负载,在验证错误率为 ~23% 的情况下,该差异可能约为 +/- 0.1%。

How can optimization failures be debugged and mitigated?

Summary: If the model is experiencing optimization difficulties, it’s important to fix them before trying other things. Diagnosing and correcting training failures is an active area of research.

Figure 4: Changing the strides in a single residual block (2x2 -> 1x1) in a WideResnet results in training instability. This does not degrade performance at low learning rates, but high learning rates no longer train well due to the instability. Applying 1000 steps of learning rate warmup resolves this particular instance of instability, allowing stable training at max learning rate of .1.
图 4:在 WideResnet 中更改单个残差块 (2x2 -> 1x1) 中的步幅会导致训练不稳定。这不会降低低学习率下的表现,但由于不稳定,高学习率不再训练良好。应用 1000 步学习率热身解决了这种特殊的不稳定情况,允许以 .1 的最大学习率进行稳定的训练。

Identifying unstable workloads

  • Any workload will become unstable if the learning rate is too large. Instability is only an issue when it forces us to use a learning rate that’s too small.
  • There are at least two types of training instability worth distinguishing:
    1. Instability at initialization/early in training.
    2. Sudden instability in the middle of training.
  • We can take a systematic approach to identifying stability issues in our workload.
    1. Do a learning rate sweep and find the best learning rate lr*.
      进行学习率扫描并找到最佳学习率 lr*。
    2. Plot training loss curves for learning rates just above lr*.
      绘制略高于 lr* 的学习率的训练损失曲线。
    3. If the learning rates > lr* show loss instability (loss goes up not down during periods of training), then it is likely that fixing the instability will result in better training.
  • Log the L2 norm of the full loss gradient during training, outlier values can result in spurious instability in the middle of training. This can inform how to pick gradient/update clipping.
    在训练期间记录完整损失梯度的 L2 范数,异常值可能会导致训练过程中出现杂散不稳定。这可以告知如何选择渐变/更新裁剪。

NOTE: Some models show very early instability followed by a recovery that results in slow but stable training. Common evaluation schedules can miss these issues by not evaluating frequently enough!

To check for this, we can train for an abbreviated run of just ~500 steps using lr = 2 * current best, but evaluate every step.
为了检查这一点,我们可以使用 lr = 2 * current best 来训练仅 ~500 步的简短运行,但评估每一步。

Figure 5: Illustration of the value of more frequent evaluations at the start of training. Useful if there’s a suspicion that the model suffers from early training instability.
图 5:说明在培训开始时进行更频繁评估的价值。如果怀疑模型存在早期训练不稳定,则很有用。

Potential fixes for common instability patterns

  • Apply learning rate warmup
    • Best for early training instability.
  • Apply gradient clipping  应用渐变剪切
    • Good for both early and mid training instability, may fix some bad inits that warmup cannot.
  • Try a new optimizer
    • Sometimes Adam can handle instabilities that Momentum can’t. This is an active area of research.
  • We can ensure that we’re using best practices/initializations for our model architecture (examples below).
    • Add residual connections and normalization if the model doesn't contain it already.
  • Normalization should be the last operation before the residual. E.g. x + Norm(f(x)).
    归一化应该是残差之前的最后一个操作。例如 x + 范数(f(x))。
  • Norm(x + f(x)) known to cause issues.
    已知会导致问题的规范(x + f(x))。
  • Try initializing residual branches to 0 (e.g. ReZero init).
    尝试将残差分支初始化为 0(例如 ReZero init)。
  • Lower the learning rate
    • This is a last resort.

Learning rate warmup 学习率预热

Figure 6: An example of instability during a warmup period (note the horizontal axis log scale). 40k steps of warmup was needed for successful training in this case.
图 6:预热期间的不稳定示例(注意水平轴对数刻度)。在这种情况下,需要 40k 步的热身才能成功训练。

When to apply learning rate warmup

Figure 7a: An example of a hyperparameter axis plot for a model exhibiting training instability. The best learning rate is at the edge of what is feasible. An "infeasible" trial is defined as one that either produces NaNs or uncharacteristically high values of the loss.
图 7a:表现出训练不稳定的模型的超参数轴图示例。最好的学习率处于可行的边缘。“不可行”的试验被定义为产生NaN或异常高的损失值的试验。

Figure 7b: The training loss of a model trained with a learning rate where we see instability.
图 7b:使用学习率训练的模型的训练损失,我们看到不稳定。

  • Figure 7a shows a hyperparameter axis plot that indicates a model experiencing optimization instabilities, because the best learning rate is right at the edge of instability.
    图 7a 显示了一个超参数轴图,该图指示模型遇到优化不稳定,因为最佳学习率正好处于不稳定的边缘。
  • Figure 7b shows how this can be double-checked by examining the training loss of a model trained with a learning rate either 5x or 10x larger than this peak. If that plot shows a sudden rise in the loss after a steady decline (e.g. at step ~10k in the figure above), then the model likely suffers from optimization instability.
    图 7b 显示了如何通过检查学习率比该峰值大 5 倍或 10 倍的训练模型的训练损失来仔细检查这一点。如果该图显示损失在稳定下降后突然上升(例如,在上图中的步长~10k处),则模型可能会受到优化不稳定的影响。

How to apply learning rate warmup

Figure 8: Beneficial effect of learning rate warmup on addressing training instabilities.
图 8:学习率预热对解决训练不稳定性的有益影响。

  • Using the section immediately above, we assume that the practitioner has already identified the learning rate at which the model becomes unstable. This is the unstable_base_learning_rate.
    使用上面的部分,我们假设从业者已经确定了模型变得不稳定的学习率。这是 unstable_base_learning_rate .
  • Warmup involves prepending a learning rate schedule that ramps up the learning rate from 0 to some stable base_learning_rate, that is at least one order of magnitude larger than unstable_base_learning_rate. The default would be to try a base_learning_rate that’s 10x unstable_base_learning_rate. Although note that it’d be possible to run this entire procedure again for something like 100x unstable_base_learning_rate. The specific schedule is:
    热身包括预先预置一个学习率计划,将学习率从 0 提高到某个稳定 base_learning_rate ,至少比 大 unstable_base_learning_rate 一个数量级。默认是尝试 base_learning_rate 10x unstable_base_learning_rate .尽管请注意,可以再次运行整个过程,例如 100x unstable_base_learning_rate .具体时间表如下:
    • Ramp up from 0 to base_learning_rate over warmup_steps.
      从 0 上升到 base_learning_rate 超过 warmup_steps 。
    • Train at a constant rate for post_warmup_steps.
      以恒定速率训练 post_warmup_steps .
  • Our goal is to find the shortest number of warmup_steps that allows us to access peak learning rates that are much higher than unstable_base_learning_rate.
    我们的目标是找到允许我们访问远高于 unstable_base_learning_rate 的峰值学习率的最短数量 warmup_steps 。
  • So for each base_learning_rate, we need to tune warmup_steps and post_warmup_steps. It’s usually fine to set post_warmup_steps to be 2*warmup_steps.
    所以对于每一个 base_learning_rate ,我们需要调谐 warmup_steps 和 post_warmup_steps 。通常可以设置为 post_warmup_steps 2*warmup_steps .
  • Warmup can be tuned independently of an existing decay schedule. warmup_steps should be swept at a few different orders of magnitude. For example, an example study could try [10, 103, 104, 105]. The largest feasible point shouldn't be more than 10% of max_train_steps.
    预热可以独立于现有的衰减时间表进行调整。 warmup_steps 应该扫几个不同的数量级。例如,一个示例研究可以尝试 [10, 10, 10 4 , 10 5 3 ]。最大可行点不应超过 max_train_steps 10%。
  • Once a warmup_steps that doesn't blow up training at base_learning_rate has been established, it should be applied to the baseline model. Essentially, we prepend this schedule onto the existing schedule, and use the optimal checkpoint selection discussed above to compare this experiment to the baseline. For example, if we originally had 10,000 max_train_steps and did warmup_steps for 1000 steps, the new training procedure should run for 11,000 steps total.
    一旦建立了不会爆炸训练 base_learning_rate 的模型,就应该将其应用于基线 warmup_steps 模型。从本质上讲,我们将此时间表附加到现有时间表之前,并使用上面讨论的最佳检查点选择将此实验与基线进行比较。例如,如果我们最初有 10,000 个步骤, max_train_steps 并且执行 warmup_steps 了 1000 个步骤,则新的训练过程总共应运行 11,000 个步骤。
  • If long warmup_steps are required for stable training (>5% of max_train_steps), max_train_steps may need to be increased to account for this.
    如果需要长期 warmup_steps 稳定训练(>5%) max_train_steps max_train_steps ,可能需要增加以说明这一点。
  • There isn't really a "typical" value across the full range of workloads. Some models only need 100 steps, while others (particularly transformers) may need 40k+.
    在整个工作负载范围内,实际上并没有“典型”值。有些型号只需要 100 步,而其他型号(尤其是变压器)可能需要 40k+。

Gradient clipping 渐变剪切

Figure 9: Illustration of gradient clipping correcting early training instability.
图 9:梯度裁剪图示纠正早期训练不稳定性。

  • Gradient clipping is most useful when large or outlier gradient issues occur.
  • Clipping can fix either early training instability (large gradient norm early), or mid training instabilities (sudden gradient spikes mid training).
  • Sometimes longer warmup periods can correct instabilities that clipping does not: see this section above.
    • ��� What about clipping during warmup?
      ��� 热身时剪裁怎么办?
  • The ideal clip thresholds are just above the "typical" gradient norm.
  • Here’s an example of how gradient clipping could be done:
    • If the norm of the gradient |�| is greater than the gradient clipping threshold �, then do �′=�×�|�| where �′ is the new gradient.
      如果渐变的范数大于渐变裁剪阈值 � ,则执行 �′=�×�|�| 新渐变 |�| 的位置 �′ 。
  • Log the unclipped gradient norm during training. By default, generate:
    • A plot of gradient norm vs step
    • A histogram of gradient norms aggregated over all steps
  • Choose a gradient clipping threshold based on the 90th percentile of gradient norms.
    根据梯度范数的第 90 个百分位数选择渐变裁剪阈值。
    • The threshold will be workload dependent, but 90% is a good starting point. If it doesn't work, this threshold can be tuned.
      阈值将取决于工作负载,但 90% 是一个很好的起点。如果它不起作用,可以调整此阈值。
    • ��� What about some sort of adaptive strategy?
      ��� 某种适应性策略呢?
  • If we try gradient clipping and the instability issues remain, we can try it harder (i.e. make the threshold smaller).
  • Extremely aggressive gradient clipping is in essence a strange way of reducing the learning rate. If we find ourselves using extremely aggressive clipping, we probably should just cut the learning rate instead.
  • We would usually consider having >50% of the updates getting clipped somehow as "extremely aggressive".
  • If we need to do extremely aggressive gradient clipping to deal with our instability issues, then we might as well reduce the learning rate.

Why do you call the learning rate and other optimization parameters hyperparameters? They are not parameters of any prior distribution.

[Click to expand] [点击展开]

  • It is true that the term "hyperparameter" has a precise meaning in Bayesian machine learning and referring to the learning rate and most of the other parameters we tune in deep learning as "hyperparameters" is an abuse of terminology.
  • We would prefer to use the term "metaparameter" for learning rates, architectural parameters, and all the other things we tune in deep learning, since it avoids the potential for confusion that comes from misusing the word "hyperparameter" (confusion that is especially likely when discussing Bayesian optimization where the probabilistic response surface models have their own true hyperparameters).
  • Unfortunately, although potentially confusing, the term hyperparameter has become extremely common in the deep learning community.
  • Therefore, for a document, such as this one, intended for a wide audience that includes many people who are unlikely to be aware of this technicality, we made the choice to contribute to one source of confusion in the field in hopes of avoiding another.
  • That said, we might make a different choice when publishing a research paper, and we would encourage others to use "metaparameter" instead in most contexts.

Why shouldn't the batch size be tuned to directly improve validation set performance?

[Click to expand] [点击展开]

  • Changing the batch size without changing any other details of the training pipeline will often affect the validation set performance.
  • However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.
  • The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
    • Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques.
  • In addition, the number of training steps may need to be adjusted when changing the batch size.
  • Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see Shallue et al. 2018).

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


