一个损失函数和BertAdam warmup策略

咪咕班克斯

已于 2022-03-22 10:43:58 修改

阅读量536

点赞数 2

分类专栏：基础python代码调试文章标签：分类

于 2021-08-15 12:45:06 首次发布

本文链接：https://blog.csdn.net/u012211422/article/details/119712309

版权

基础python代码调试专栏收录该内容

14 篇文章 0 订阅

订阅专栏

一个损失函数

来自Noise Contrastive Estimation --- 从 NCE 到 InfoNCE - Sanny.Liu-CV&&ML - 博客园

loss函数之CosineEmbeddingLoss，HingeEmbeddingLoss_ltochange的博客-CSDN博客

注意Hingeloss target 必须是-1和1

Wasserstein 一个不重叠也能比较两个分布的度量

PyTorch 实战：计算 Wasserstein 距离 - AHU-WangXiao - 博客园

在使用pytorch 时，只需将上面共享代码中的有些量 to(device)指定到GPU上即可。

BertAdam warmup策略

LXMERT 中涉及的warmup 策略

"""Implements BERT version of Adam algorithm with weight decay fix.
Params:
lr: learning rate
warmup: portion of t_total for the warmup, -1 means no warmup. Default: -1
t_total: total number of training steps for the learning
rate schedule, -1 means constant learning rate. Default: -1
schedule: schedule to use for the warmup (see above). Default: 'warmup_linear'
b1: Adams b1. Default: 0.9
b2: Adams b2. Default: 0.999
e: Adams epsilon. Default: 1e-6
weight_decay: Weight decay. Default: 0.01
max_grad_norm: Maximum norm for the gradients (-1 means no clipping). Default: 1.0

warmup: portion of t_total for the warmup, -1 means no warmup. Default: -1

t_total：用总共的epoch 轮数，乘上每轮的迭代步数。每次迭代的主体是一个batch 的样本所提供的梯度。

看上去下面的函数展示的是比率，当步数达到某一数值时，实行另一种学习率调整。在达到该步骤前的学习率变化速度是一致的，每个策略函数返回的都是x/warmup 只是快到结束的时候的学习率各自函数进行变化

下面各函数应该返回的数值是学习率大小。这个函数就是嵌入到每次的 loss.backward()、 self.optim.step()

所以下面的x都是慢慢递增的，x 就是当前的迭代步数占总迭代步数t_total 的比率

def warmup_cosine(x, warmup=0.002):
    if x < warmup:
        return x/warmup
    return 0.5 * (1.0 + torch.cos(math.pi * x))

def warmup_constant(x, warmup=0.002):
    """ Linearly increases learning rate over `warmup`*`t_total` (as provided to BertAdam) training steps.
        Learning rate is 1. afterwards. """
    if x < warmup:
        return x/warmup
    return 1.0

def warmup_linear(x, warmup=0.002):
    """ Specifies a triangular learning rate schedule where peak is reached at `warmup`*`t_total`-th (as provided to BertAdam) training step.
        After `t_total`-th training step, learning rate is zero. """
    if x < warmup:
        return x/warmup
    return max((x-1.)/(warmup-1.), 0)

SCHEDULES = {
    'warmup_cosine':   warmup_cosine,
    'warmup_constant': warmup_constant,
    'warmup_linear':   warmup_linear,
}