1、不同层不同学习率
2、冻结某些参数
for n,p in model.named_parameters():
if 'bert' in n:p.requires_grad=False
同时在filter函数中添加:
non_embedding_param = list(filter(
lambda x:id(x) not in embedding_param_ids and id(x) not in bert_embedding_param_ids and x.requireds_grad,
model.parameters()))
设置优化器
optimizer = optim.SGD(param_,lr=args.lr,momentum=args.momentum,
weight_decay=args.weight_decay)