一个损失函数
来自Noise Contrastive Estimation --- 从 NCE 到 InfoNCE - Sanny.Liu-CV&&ML - 博客园
loss函数之CosineEmbeddingLoss,HingeEmbeddingLoss_ltochange的博客-CSDN博客
注意Hingeloss target 必须是-1和1
Wasserstein 一个不重叠也能比较两个分布的度量
PyTorch 实战:计算 Wasserstein 距离 - AHU-WangXiao - 博客园
在使用pytorch 时,只需将上面共享代码中的有些 量 to(device)指定到GPU上即可。
BertAdam warmup策略
LXMERT 中涉及的warmup 策略
"""Implements BERT version of Adam algorithm with weight decay fix.
Params:
lr: learning rate
warmup: portion of t_total for the warmup, -1 means no warmup. Default: -1
t_total: total number of training steps for the learning
rate schedule, -1 means constant learning rate. Default: -1
schedule: schedule to use for the warmup (see above). Default: 'warmup_linear'
b1: Adams b1. Default: 0.9
b2: Adams b2. Default: 0.999
e: Adams epsilon. Default: 1e-6
weight_decay: Weight decay. Default: 0.01
max_grad_norm: Maximum norm for the gradients (-1 means no clipping). Default: 1.0
warmup: portion of t_total for the warmup, -1 means no warmup. Default: -1
t_total:用总共的epoch 轮数,乘上 每轮的迭代步数。每次迭代的主体是一个batch 的样本所提供的梯度。
看上去下面的函数展示的是比率,当步数达到某一数值时,实行另一种学习率调整。 在达到该步骤前的学习率变化速度是一致的,每个策略函数返回的都是x/warmup 只是快到结束的时候的学习率各自函数进行变化
下面各函数应该返回的数值是学习率大小。这个函数就是嵌入到每次的 loss.backward()、 self.optim.step()
所以下面的x都是慢慢递增的,x 就是当前的迭代步数占总迭代步数t_total 的比率
def warmup_cosine(x, warmup=0.002):
if x < warmup:
return x/warmup
return 0.5 * (1.0 + torch.cos(math.pi * x))
def warmup_constant(x, warmup=0.002):
""" Linearly increases learning rate over `warmup`*`t_total` (as provided to BertAdam) training steps.
Learning rate is 1. afterwards. """
if x < warmup:
return x/warmup
return 1.0
def warmup_linear(x, warmup=0.002):
""" Specifies a triangular learning rate schedule where peak is reached at `warmup`*`t_total`-th (as provided to BertAdam) training step.
After `t_total`-th training step, learning rate is zero. """
if x < warmup:
return x/warmup
return max((x-1.)/(warmup-1.), 0)
SCHEDULES = {
'warmup_cosine': warmup_cosine,
'warmup_constant': warmup_constant,
'warmup_linear': warmup_linear,
}