1. Adam optimizer
adam优化器是经常使用到的模型训练时的优化器,但是在bert的训练中不起作用,具体表现是,模型的f1上不来。
2. AdamW
transformers
库实现了基于权重衰减的优化器,AdamW
,这个优化器初始化时有6个参数,第一个是params
,可以是torch的Parameter,也可以是一个grouped参数。betas是Adam的beta参数,b1和b2。eps也是Adam为了数值稳定的参数。correct_bias,如果应用到tf的模型上时需要设置为False
class AdamW(Optimizer):
"""
Implements Adam algorithm with weight decay fix as introduced in
`Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__.
Parameters:
params (:obj:`Iterable[torch.nn.parameter.Parameter]`):
Iterable of parameters to optimize or dictionaries defining parameter groups.
lr (:obj:`float`, `optional`, defaults to 1e-3):
The learning rate to use.
betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)):
Adam's betas parameters (b1, b2).
eps (:obj:`float`, `optional`, defaults to 1e-6):
Adam's epsilon for numerical stability.
weight_decay (:obj:`float`, `optional`, defaults to 0):
Decoupled weight decay to apply.
correct_bias (:obj:`bool`, `optional`, defaults to `True`):
Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`).
"""
def __init__(
self,
params: Iterable[torch.nn.parameter.Parameter],
lr: float = 1e-3,
betas: Tuple[float, float] = (0.9, 0.999),
eps: float = 1e-6,
weight_decay: float = 0.0,
correct_bias: bool = True,
):
3. BertAdam与AdamW的使用区别
# Parameters:
lr = 1e-3
max_grad_norm = 1.0
num_training_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_training_steps) # 0.1
### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_training_steps)
### and used like this:
for batch in train_data:
loss = model(batch)
loss.backward()
optimizer.step()
### In Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False) # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps) # PyTorch scheduler
### and used like this:
for batch in train_data:
model.train()
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
https://github.com/huggingface/transformers