关于少样本Transformer微调的稳定性

发呆的比目鱼

已于 2022-09-10 23:57:37 修改

阅读量798

点赞数 1

分类专栏：比赛实战文章标签：自然语言处理

于 2021-07-05 00:29:16 首次发布

本文链接：https://blog.csdn.net/weixin_42486623/article/details/118120643

版权

实战同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

比赛

5 篇文章 0 订阅

订阅专栏

关于少样本Transformer微调的稳定性

翻译自https://www.kaggle.com/rhtsingh/on-stability-of-few-sample-transformer-fine-tuning?scriptVersionId=65609052

Introduction

微调 Transformer 模型往往表现出训练不稳定。即使具有相同的超参数值（学习率、批量大小等），不同的随机种子也会导致截然不同的结果。这个问题更加明显，尤其是在小数据集上使用大型Transformer时。
本notebook将深入探讨少样本微调优化过程和技术的不同方面。目标是更好地理解我们必须处理一些样本微调问题的不同补救措施。

Problem

自 BERT 引入以来，Transformer 微调过程的不稳定性就已为人所知，从那时起，人们提出了各种方法来解决它。
例如，在本次比赛中，我们只有约 2.8k 样本。当划分为折叠时，每个模型仅接收约 2.2k 示例，并且数据也有噪声标签。因此，我们都认为稳定性的一种方法是在每个 epoch 内而不是每个 epoch 之后评估更多。

Solution

最近提出了许多提高少样本微调稳定性的方法，并且它们比简单的微调方法显示出显着的性能改进。

除偏遗漏BERTAdam
重新初始化 Transformer 层
利用中间层
逐层学习率衰减 (LLRD)
混合正则化
预训练的权重衰减
随机加权平均

注 1：这些方法是独立的，不建议同时使用所有方法。尽管混合两种或多种技术可能会带来改进，但这可能并不总是正确的。

除偏遗漏BERTAdam

Introduction

BERTAdam 是最常用的优化器，用于微调 Transformer，它是 ADAM 优化器的修改版本。
它与原始 ADAM 算法 (Kingma & Ba, 2014) 的不同之处在于省略了偏差校正步骤。这一变化在 BERT 论文中被引入，随后进入了常见的开源库，包括官方实现 HuggingFace Transformers。

Adam 伪代码

要求： $α$ ：学习率； $\in [0, 1)$ ：矩估计的指数衰减率； $f (θ)$ ：参数为θ的随机目标函数； $θ 0$ ：初始参数向量； $\in [0, 1)$ ：解耦权重衰减。

01: $m 0 \leftarrow 0$ (初始化矩向量)
02: $v 0 \leftarrow 0$ (初始化二阶矩向量
03: $t \leftarrow 0$ (初始化时间戳)
04: while θt not converged do (初始化时间戳)
05: $t \leftarrow t + 1$
06: $g t \leftarrow \nabla θ f t (θt - 1)$ (得到随机目标在时间步t处的梯度)
07: $m t \leftarrow β 1 \cdot m t - 1 + (1 - β 1) \cdot g t$ (更新有偏差的一次矩估计)
08: $vt ← β2 · vt−1 + (1 − β2) · g^2t$ 计算偏校正的二次原始矩估计)
09: $m t \leftarrow m t / (1 - βt 1)$ (计算偏置修正的一次矩估计)
10: $v t \leftarrow v t / (1 - βt 2)$ (计算偏校正的二次原始矩估计)
11: $θt \leftarrow θt - 1 - α \cdot m t / (\sqrt v t + e)$ (更新参数)
12: end while
13: return θt (结果参数)

上面显示了ADAM算法，并突出显示了非标准BERTAdam实现中省略的行。没有偏差校正会导致退化，而且有时对于少量样本，微调的模型无法优于随机基线。
在小模型上使用BERTAdam训练的模型会导致欠拟合，为了保持简单，这种校正对于在小数据集上进行Transformer微调至关重要，即在少于10k训练样本的情况下。

Implementation

在这里，我们将使用HuggingFactory Transformers库实施偏差纠正亚当。这是相对简单的使用HuggingFace AdamW优化器设置“correct_bias”参数为True。
注意:HuggingFactory Transformers AdamW的“correct_bias”参数默认设置为True。但是值得注意的是这个参数的重要性。

from transformers import (
    AdamW,
    AutoConfig,
    AutoModelForSequenceClassification
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_pretrained_model = 'roberta-base'
lr = 2e-5
epsilon = 1e-6
weight_decay = 0.01
use_bertadam = False

config = AutoConfig.from_pretrained(_pretrained_model)
model = AutoModelForSequenceClassification.from_pretrained(
    _pretrained_model, 
    config=config
)

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [{
    "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
    "weight_decay": weight_decay,
    "lr": lr,
},
{
    "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
    "weight_decay": 0.0,
    "lr": lr,
}]

optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=lr,
    eps=epsilon,
    correct_bias=not use_bertadam # bias correction step
)

del model, optimizer_grouped_parameters, optimizer
gc.collect();

参考资料

重新初始化Transformer层

介绍

这是一种非常有趣的技术，我们使用原始的Transformer初始化重新初始化池池层和顶层Transformer块，而不是为所有层使用预先训练的权重。重新初始化的层会破坏那些特定块的预先训练的知识。

想法

这个想法是由计算机视觉迁移学习结果驱动的，我们知道较低的预训练层学习更多的一般特征，而更高的层更接近输出专业化的预训练任务。使用Transformer的现有方法表明，使用完整的网络并不总是最有效的选择，而且通常会降低训练速度，损害性能。

实施

不同transformers的实现取决于它们的类型(自动编码、自回归等)。
我们将为RoBERTa, XLNet, BART三个架构实现池重新初始化和块初始化。

Pooler 重新初始化

我们通过简单地采用与第一个令牌对应的隐藏状态“pool”。

import torch
import torch.nn as nn
from transformers import RobertaModel, RobertaConfig
from transformers.models.roberta.modeling_roberta import RobertaClassificationHead

_model_type = 'roberta'
_pretrained_model = 'roberta-base'
config = RobertaConfig.from_pretrained(_pretrained_model)
add_pooler = True
reinit_pooler = True

class Net(nn.Module):
    def __init__(self, config, _pretrained_model, add_pooler):
        super(Net, self).__init__()
        self.roberta = RobertaModel.from_pretrained(_pretrained_model, add_pooling_layer=add_pooler)
        self.classifier = RobertaClassificationHead(config)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
        )
        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)
        return logits
        
model = Net(config, _pretrained_model, add_pooler)

if reinit_pooler:
    print('Reinitializing Pooler Layer ...')
    encoder_temp = getattr(model, _model_type)
    encoder_temp.pooler.dense.weight.data.normal_(mean=0.0, std=encoder_temp.config.initializer_range)
    encoder_temp.pooler.dense.bias.data.zero_()
    for p in encoder_temp.pooler.parameters():
        p.requires_grad = True
    print('Done.!')
    
del model
gc.collect();

层重新初始化 - Roberta

Roberta在BERT上建立并修改了关键参数，从而删除了下一句预训练目标，更大的批量和学习率。
RoBERTa具有与BERT相同的架构，但是使用byte-level BPE作为分词器(与GPT-2相同)，并使用不同的预处理方案。
RoBERTa没有token_type_ids，你不需要指明哪个标记属于哪个段。只需使用分离令牌分词器。sep_token。

注1:TF版本使用截断的正常初始化。
注意2:要检查权重是否被重新初始化，请在重新初始化前后运行此代码块

for layer in model.roberta.encoder.layer[-reinit_layers:]:
    for module in layer.modules():
        if isinstance(module, nn.Linear):
            print(module.weight.data)

from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'roberta'
_pretrained_model = 'roberta-base'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    encoder_temp = getattr(model, _model_type)
    for layer in encoder_temp.encoder.layer[-reinit_layers:]:
        for module in layer.modules():
            if isinstance(module, nn.Linear):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.Embedding):
                module.weight.data.normal_(mean=0.0, std=config.initializer_range)
                if module.padding_idx is not None:
                    module.weight.data[module.padding_idx].zero_()
            elif isinstance(module, nn.LayerNorm):
                module.bias.data.zero_()
                module.weight.data.fill_(1.0)
    print('Done.!')

del model
gc.collect();

层重新初始化 - XLNET

XLNet是少数没有序列长度限制的模型之一。
XLNET是使用自回归方法预训练的Transformer-XL模型的扩展，以学习双向上下文。

注意:TF版本使用截断的正常初始化。

from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
from transformers.models.xlnet.modeling_xlnet import XLNetRelativeAttention
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'xlnet'
_pretrained_model = 'xlnet-base-cased'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    for layer in model.transformer.layer[-reinit_layers :]:
        for module in layer.modules():
            if isinstance(module, (nn.Linear, nn.Embedding)):
                module.weight.data.normal_(mean=0.0, std=model.transformer.config.initializer_range)
                if isinstance(module, nn.Linear) and module.bias is not None:
                    module.bias.data.zero_()
            elif isinstance(module, nn.LayerNorm):
                module.bias.data.zero_()
                module.weight.data.fill_(1.0)
            elif isinstance(module, XLNetRelativeAttention):
                for param in [
                    module.q,
                    module.k,
                    module.v,
                    module.o,
                    module.r,
                    module.r_r_bias,
                    module.r_s_bias,
                    module.r_w_bias,
                    module.seg_embed,
                ]:
                    param.data.normal_(mean=0.0, std=model.transformer.config.initializer_range)
    print('Done.!')
    
del model
gc.collect();

层重新初始化 - BART

Bart使用标准的SEQ2SEQ /机器翻译架构，具有双向编码器（如BERT）和左右解码器（如GPT）。
预训练任务涉及随机洗牌原始句子的顺序和新的内填充方案，其中文本的跨度被单个掩模令牌替换。

from transformers import AutoConfig
from transformers import AutoModelForSequenceClassification
from transformers import logging
from transformers.models.xlnet.modeling_xlnet import XLNetRelativeAttention
logging.set_verbosity_warning()
logging.set_verbosity_error()

reinit_layers = 2
_model_type = 'bart'
_pretrained_model = 'facebook/bart-base'
config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if reinit_layers > 0:
    print(f'Reinitializing Last {reinit_layers} Layers ...')
    for layer in model.model.decoder.layers[-reinit_layers :]:
        for module in layer.modules():
            model.model._init_weights(module)
    print('Done.!')

del model
gc.collect();

对重新初始化层数的敏感性

实验表明，再初始化对不利的随机种子具有较好的鲁棒性。当只重新初始化池程序层时，可以看到改进。重新初始化其他层会有更多帮助。
然而，不建议重新初始化超过6层的性能，因为进一步的重新初始化会破坏具有一般重要特征的预训练层，从而导致性能趋于稳定甚至下降。reinit层的最佳数量因数据集而异。

参考和资源

利用中间层次

介绍

这是使用探测方法广泛研究的最佳技术之一，这表明中间层的预先训练的特征更可转移。
在HuggingFace Transformers中有2个主输出，如果配置有3个主输出;将input_ids和attention_mask作为输入

last hidden state (batch size, seq Len, hidden size)，是最后一层输出的隐藏状态序列。
pooler output (batch size, hidden size)-序列的第一个标记的最后一层隐藏状态
all hidden states 所有层和所有ids隐藏状态。

想法

正如我们之前在重新初始化部分所讨论的，在对下游任务进行微调期间，最后一层的输出可能并不总是输入文本的最佳表示。
对于预训练过的语言模型(包括Transformer)，最可转移的上下文化输入文本表示倾向于出现在中间层，而顶层专门用于语言建模。因此，单一使用最后一层的输出可能会限制预训练表示的能力。

实施

我们有多个依赖于应用程序的策略来获取中间表示，并不是所有的策略都可以在这个笔记本中共享。但是，我将在这里分享最有用的一个，它有助于改善几乎任何类型的问题。
WeightedLayerPooling令牌嵌入是它们不同隐藏层表示的加权平均值。

import torch
import torch.nn as nn
import pandas as pd
from transformers import (
    AutoConfig, 
    AutoModel, 
    AutoTokenizer
)

_pretrained_model = 'roberta-base'
batch_size = 16
max_seq_length = 256

train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
texts = train['excerpt'][:batch_size].tolist()

config = AutoConfig.from_pretrained(_pretrained_model)
# configure to output all hidden states as well
config.update({'output_hidden_states':True}) 
model = AutoModel.from_pretrained(_pretrained_model, config=config)
tokenizer = AutoTokenizer.from_pretrained(_pretrained_model)

features = tokenizer.batch_encode_plus( 
    texts, 
    max_length=max_seq_length,
    padding='max_length', 
    truncation=True, 
    add_special_tokens=True,
    return_attention_mask=True, 
    return_tensors='pt'
)
print(features['input_ids'].shape)

outputs = model(features['input_ids'], attention_mask=features['attention_mask'])
print("Total number of outputs: ", len(outputs))
print('Shape of 1st output', outputs[0].shape)
print('Shape of 2nd output', outputs[1].shape)
print('Length of 3rd output', len(outputs[2]))

我们可以看到，在将output_hidden_states设置为True之后，我们现在收到了三个不同的输出。
我们有13个隐藏层输出，尽管模型中有12个隐藏层，因为我们也接收嵌入层的输出。

class WeightedLayerPooling(nn.Module):
    def __init__(self, num_hidden_layers, layer_start: int = 4, layer_weights = None):
        super(WeightedLayerPooling, self).__init__()
        self.layer_start = layer_start
        self.num_hidden_layers = num_hidden_layers
        self.layer_weights = layer_weights if layer_weights is not None \
            else nn.Parameter(
                torch.tensor([1] * (num_hidden_layers+1 - layer_start), dtype=torch.float)
            )

    def forward(self, features):
        ft_all_layers = features['all_layer_embeddings']

        all_layer_embedding = torch.stack(ft_all_layers)
        all_layer_embedding = all_layer_embedding[self.layer_start:, :, :, :]

        weight_factor = self.layer_weights.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).expand(all_layer_embedding.size())
        weighted_average = (weight_factor*all_layer_embedding).sum(dim=0) / self.layer_weights.sum()

        features.update({'token_embeddings': weighted_average})
        return features

现在我们将添加我们隐藏层输出到功能与关键的all_layer_embeddings，并传递给WeightedLayerPooling操作。我们将使用我们最后4个隐藏层中的隐藏状态。我们将添加加权层池输出到我们的功能与键-token_embeddings

layer_start = 9
pooler = WeightedLayerPooling(
    config.num_hidden_layers, 
    layer_start=layer_start, layer_weights=None
)
features.update({'all_layer_embeddings':outputs[2]})
features = pooler(features)
print("Weighted Layer Pooling Embeddings Shape: ", features['token_embeddings'].shape)

现在我们有了最后四层的组合表示。我们现在可以只需将CLS令牌输出连接。为BERT, RoBERTa等人在HuggingFace Transformers中实现的标准池操作。也可以在这里应用。下面我们简单地获取cls令牌输出并从线性层传递它。

sequence_output = features['token_embeddings'][:, 0]
outputs = nn.Linear(config.hidden_size, 1)(sequence_output)
print("Outputs Shape: ", outputs.shape)

del model, tokenizer
gc.collect();

池化策略和层选择

BERT作者通过向用于命名实体识别任务的BiLSTM输入不同的向量组合作为输入特征来测试单词嵌入策略，并观察得到的F1分数。注意到BERT的不同层编码的信息种类非常不同，因此适当的池化策略将根据应用程序而改变，因为不同的层编码不同种类的信息。
韩晓在GitHub上创建了一个名为BERT-as-service的开源项目，该项目旨在使用BERT的“BERT-as-service”为你的文本创建单词嵌入，默认使用模型倒数第二层的输出。

他的观察是：

嵌入从第一层开始，没有上下文信息。
随着嵌入网络的深入，每一层提取的上下文信息也越来越多。
然而，当您接近最后一层时，您开始收集特定于BERT的预处理任务(掩蔽语言模型(MLM)和下一句预测(NSP))的信息。
- 我们想要的是能够很好地编码单词含义的嵌入
- BERT有动机这样做，但它也有动机编码任何其他东西，以帮助它确定遗漏的单词是什么(MLM)，或者第二个句子是否在第一个句子之后(NSP)。
倒数第二层是韩寒确定的一个合理的最佳位置。

参考和资源

LLRD - 层状学习率衰减

介绍

LLRD是一种对顶层应用较高的学习速率，对底层应用较低的学习速率的方法。这是通过设置顶层的学习速率并使用乘法衰减速率逐层从上到下降低学习速率来实现的。

目标是修改较低的层次，这些层次编码的信息更一般，而顶层编码的信息更具体，更适合训练前的任务。这种方法被用于对一些最近的预训练模型进行微调，包括XLNet和ELECTRA。

实施

Guide to HuggingFace Schedulers & Differential LRs
本笔记介绍了各种差动学习率策略，但不是这个。我们将在此实施官方LLRD，并可视化各个层的学习率如何变化。
首先我们导入必要的模块，定义模型参数，优化参数，调度参数，然后创建模型和配置。

from transformers import (
    AdamW, 
    AutoConfig, 
    AutoModelForSequenceClassification,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_model_type = 'roberta'
_pretrained_model = 'roberta-base'
# optimizer params
learning_rate = 5e-5
layerwise_learning_rate_decay = 0.9
weight_decay = 0.01
adam_epsilon = 1e-6
use_bertadam = False
# scheduler params
num_epochs = 20
num_warmup_steps = 0

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

下面是我们的LLRD函数，我们将首先初始化特定于任务的head。然后我们将我们的learning rate与layerwise learning rate decay相乘，并将其分配给每个Transformers块。
正如我们将看到的，顶层更接近特定任务的头比底层有更高的学习率。

def get_optimizer_grouped_parameters(
    model, model_type, 
    learning_rate, weight_decay, 
    layerwise_learning_rate_decay
):
    no_decay = ["bias", "LayerNorm.weight"]
    # initialize lr for task specific layer
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if "classifier" in n or "pooler" in n],
            "weight_decay": 0.0,
            "lr": learning_rate,
        },
    ]
    # initialize lrs for every layer
    num_layers = model.config.num_hidden_layers
    layers = [getattr(model, model_type).embeddings] + list(getattr(model, model_type).encoder.layer)
    layers.reverse()
    lr = learning_rate
    for layer in layers:
        lr *= layerwise_learning_rate_decay
        optimizer_grouped_parameters += [
            {
                "params": [p for n, p in layer.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": weight_decay,
                "lr": lr,
            },
            {
                "params": [p for n, p in layer.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
                "lr": lr,
            },
        ]
    return optimizer_grouped_parameters

我们创建分组parameters, 初始化 optimizer 和scheduler.。

grouped_optimizer_params = get_optimizer_grouped_parameters(
    model, _model_type, 
    learning_rate, weight_decay, 
    layerwise_learning_rate_decay
)
optimizer = AdamW(
    grouped_optimizer_params,
    lr=learning_rate,
    eps=adam_epsilon,
    correct_bias=not use_bertadam
)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_epochs
)

可视化

现在我们将像其他正常训练一样执行optimizer.step() 和scheduler.step() ，并在每个epoch中收集每一层的学习速率。然后我们将把学习速率可视化。

注意:可视化已经使用plotly完成，并且已经被隐藏。

(learning_rates1, learning_rates2, learning_rates3, learning_rates4,
learning_rates5, learning_rates6, learning_rates7, learning_rates8,
learning_rates9, learning_rates10, learning_rates11, learning_rates12, 
learning_rates13, learning_rates14) = [[] for i in range(14)]

def collect_lr(optimizer):
    learning_rates1.append(optimizer.param_groups[0]["lr"])
    learning_rates2.append(optimizer.param_groups[2]["lr"])
    learning_rates3.append(optimizer.param_groups[4]["lr"])
    learning_rates4.append(optimizer.param_groups[6]["lr"])
    learning_rates5.append(optimizer.param_groups[8]["lr"])
    learning_rates6.append(optimizer.param_groups[10]["lr"])
    learning_rates7.append(optimizer.param_groups[12]["lr"])
    learning_rates8.append(optimizer.param_groups[14]["lr"])
    learning_rates9.append(optimizer.param_groups[16]["lr"])
    learning_rates10.append(optimizer.param_groups[18]["lr"])
    learning_rates11.append(optimizer.param_groups[20]["lr"])
    learning_rates12.append(optimizer.param_groups[22]["lr"])
    learning_rates13.append(optimizer.param_groups[24]["lr"])
    learning_rates14.append(optimizer.param_groups[26]["lr"])

collect_lr(optimizer)
for epoch in range(num_epochs):
    optimizer.step()
    scheduler.step()
    collect_lr(optimizer)

import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
import plotly.offline as pyo
pio.templates.default='plotly_white'

def get_default_layout(title):
    font_style = 'Courier New'
    layout = {}
    #layout['height'] = 400
    #layout['width'] = 1200
    layout['template'] = 'plotly_white'
    layout['dragmode'] = 'zoom'
    layout['hovermode'] = 'x'
    layout['hoverlabel'] = {
        'font_size': 14,
        'font_family':font_style
    }
    layout['font'] = {
        'size':14,
        'family':font_style,
        'color':'rgb(128, 128, 128)'
    }
    layout['xaxis'] = {
        'title': 'Epochs',
        'showgrid': True,
        'type': 'linear',
        'categoryarray': None,
        'gridwidth': 1,
        'ticks': 'outside',
        'showline': True, 
        'showticklabels': True,
        'tickangle': 0,
        'tickmode': 'array'
    }
    layout['yaxis'] = {
        'title': 'Learning Rate',
        'exponentformat':'none',
        'showgrid': True,
        'type': 'linear',
        'categoryarray': None,
        'gridwidth': 1,
        'ticks': 'outside',
        'showline': True, 
        'showticklabels': True,
        'tickangle': 0,
        'tickmode': 'array'
    }
    layout['title'] = {
        'text':title,
        'x': 0.5,
        'y': 0.95,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {
            'family':font_style,
            'size':14,
            'color':'black'
        }
    }
    layout['showlegend'] = True
    layout['legend'] = {
        'x':0.1,
        'y':1.1,
        'orientation':'h',
        'itemclick': 'toggleothers',
        'font': {
            'family':font_style,
            'size':14,
            'color':'black'
        }
    }
    return go.Layout(layout)

def build_trace(learning_rates, num_epochs, name, color):
    return go.Scatter(
        x=list(range(0, num_epochs, 1)), 
        y=learning_rates, 
        texttemplate="%{y:.6f}",
        mode='markers+lines',
        name=name,
        marker=dict(color=color),
    )

trace1 = build_trace(learning_rates1, num_epochs, name='Regressor', color='#83c8d2')
trace2 = build_trace(learning_rates2, num_epochs, name='Layer 12', color='#82c9d2')
trace3 = build_trace(learning_rates3, num_epochs, name='Layer 11', color='#85c7cf')
trace4 = build_trace(learning_rates4, num_epochs, name='Layer 10', color='#88c4cc')
trace5 = build_trace(learning_rates5, num_epochs, name='Layer 9', color='#8cc1c8')
trace6 = build_trace(learning_rates6, num_epochs, name='Layer 8', color='#8fbfc5')
trace7 = build_trace(learning_rates7, num_epochs, name='Layer 7', color='#92bcc2')
trace8 = build_trace(learning_rates8, num_epochs, name='Layer 6', color='#96babe')
trace9 = build_trace(learning_rates9, num_epochs, name='Layer 5', color='#99b7bb')
trace10 = build_trace(learning_rates10, num_epochs, name='Layer 4', color='#9cb4b8')
trace11 = build_trace(learning_rates11, num_epochs, name='Layer 3', color='#a0b2b4')
trace12 = build_trace(learning_rates12, num_epochs, name='Layer 2', color='#a3afb1')
trace13 = build_trace(learning_rates13, num_epochs, name='Layer 1', color='#a7adad')
trace14 = build_trace(learning_rates14, num_epochs, name='Embeddings', color='#aaa')

layout=get_default_layout('Layer Wise Learning Rate Decay')
fig = go.Figure(
    data=[
        trace1, trace2, trace3, trace4, trace5, trace6, 
        trace7, trace8, trace9, trace10, trace11, trace12, 
        trace13, trace14
    ], 
    layout=layout.update({'showlegend':False})
)

fig.show()

del model, grouped_optimizer_params, optimizer, scheduler
gc.collect();

参考和资源

Mixout正则化

介绍

Mixout是一种随机正则化技术，是由Dropout和DropConnect推动的。在每次训练迭代中，每个模型参数被替换为其预训练值，概率为p。其目标是防止灾难性遗忘，并证明了它限制了微调模型与预训练初始化的偏差过大。

想法

假设u是目标模型参数，w是当前模型参数。

我们先记住u网络参数。
在dropout网络中，我们以p的概率随机选择一个被丢弃的输入神经元(一个点连接神经元)。也就是说，所有来自被丢弃神经元的输出参数都被删除(点连接)。
在Mixout(u)网络中，将(b)中剔除的参数替换为(a)中相应的参数。换句话说，w中Mixout(u)网络是u中vanilla和w中dropout的混合，概率为p。

实施

这里我们将实现Mixout。代码取自https://github.com/bloodwass/mixout

import torch
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
from torch.nn import Parameter
from torch.autograd.function import InplaceFunction

class Mixout(InplaceFunction):
    @staticmethod
    def _make_noise(input):
        return input.new().resize_as_(input)

    @classmethod
    def forward(cls, ctx, input, target=None, p=0.0, training=False, inplace=False):
        if p < 0 or p > 1:
            raise ValueError("A mix probability of mixout has to be between 0 and 1," " but got {}".format(p))
        if target is not None and input.size() != target.size():
            raise ValueError(
                "A target tensor size must match with a input tensor size {},"
                " but got {}".format(input.size(), target.size())
            )
        ctx.p = p
        ctx.training = training

        if ctx.p == 0 or not ctx.training:
            return input

        if target is None:
            target = cls._make_noise(input)
            target.fill_(0)
        target = target.to(input.device)

        if inplace:
            ctx.mark_dirty(input)
            output = input
        else:
            output = input.clone()

        ctx.noise = cls._make_noise(input)
        if len(ctx.noise.size()) == 1:
            ctx.noise.bernoulli_(1 - ctx.p)
        else:
            ctx.noise[0].bernoulli_(1 - ctx.p)
            ctx.noise = ctx.noise[0].repeat(input.size()[0], 1)
        ctx.noise.expand_as(input)

        if ctx.p == 1:
            output = target
        else:
            output = ((1 - ctx.noise) * target + ctx.noise * output - ctx.p * target) / (1 - ctx.p)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        if ctx.p > 0 and ctx.training:
            return grad_output * ctx.noise, None, None, None, None
        else:
            return grad_output, None, None, None, None


def mixout(input, target=None, p=0.0, training=False, inplace=False):
    return Mixout.apply(input, target, p, training, inplace)


class MixLinear(torch.nn.Module):
    __constants__ = ["bias", "in_features", "out_features"]
    def __init__(self, in_features, out_features, bias=True, target=None, p=0.0):
        super(MixLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))
        if bias:
            self.bias = Parameter(torch.Tensor(out_features))
        else:
            self.register_parameter("bias", None)
        self.reset_parameters()
        self.target = target
        self.p = p

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

    def forward(self, input):
        return F.linear(input, mixout(self.weight, self.target, self.p, self.training), self.bias)

    def extra_repr(self):
        type = "drop" if self.target is None else "mix"
        return "{}={}, in_features={}, out_features={}, bias={}".format(
            type + "out", self.p, self.in_features, self.out_features, self.bias is not None
        )

上面我们定义了Mixout正则化。现在我们将把它添加到模型中。

import math
from transformers import AutoModelForSequenceClassification, AutoConfig
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_pretrained_model = 'roberta-base'
mixout = 0.7

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

if mixout > 0:
    print('Initializing Mixout Regularization')
    for sup_module in model.modules():
        for name, module in sup_module.named_children():
            if isinstance(module, nn.Dropout):
                module.p = 0.0
            if isinstance(module, nn.Linear):
                target_state_dict = module.state_dict()
                bias = True if module.bias is not None else False
                new_module = MixLinear(
                    module.in_features, module.out_features, bias, target_state_dict["weight"], mixout
                )
                new_module.load_state_dict(target_state_dict)
                setattr(sup_module, name, new_module)
    print('Done.!')

del model
gc.collect();

我们完成了，现在可以用于下游的微调任务，Mixout将完成它的工作。

结论

Mixout是一个面向优化轨迹的自适应l2正则化器，它的正则化系数沿优化路径自适应。即使只有目标任务的几个训练示例，Mixout也可以提高对一个大型的、预训练过的语言模型进行微调的稳定性。这是众所周知的提高变压器微调稳定性的技术。

参考和资源

预训练权重衰减

介绍

重量衰减(WD)是一种常见的正则化技术。在每次优化迭代中，从模型参数中减去λw， λ为正则化强度的超参数，w为模型参数。预训练权值衰减采用该方法对预训练模型进行微调，从目标减去 $λ (w - w^{'})$ ，其中 $w^{'}$ 为预训练参数。结果表明，在Transformer微调中，训练前的加权衰减比传统的加权衰减效果更好，并且可以稳定微调。

实施

在这里，我们将实现预训练权重衰减。

import torch
import torch.nn as nn
from torch.optim import Optimizer
from transformers import (
    AdamW, 
    AutoConfig, 
    AutoModelForSequenceClassification,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup
)
from transformers import logging
logging.set_verbosity_warning()
logging.set_verbosity_error()

_model_type = 'roberta'
_pretrained_model = 'roberta-base'

# optimizer params
learning_rate = 5e-5
weight_decay = 0.01
adam_epsilon = 1e-6
use_bertadam = False
use_prior_wd = True

config = AutoConfig.from_pretrained(_pretrained_model)
config.update({'num_labels':1})
model = AutoModelForSequenceClassification.from_pretrained(_pretrained_model)

class PriorWD(Optimizer):
    def __init__(self, optim, use_prior_wd=False, exclude_last_group=True):
        super(PriorWD, self).__init__(optim.param_groups, optim.defaults)
        self.param_groups = optim.param_groups
        self.optim = optim
        self.use_prior_wd = use_prior_wd
        self.exclude_last_group = exclude_last_group
        self.weight_decay_by_group = []
        for i, group in enumerate(self.param_groups):
            self.weight_decay_by_group.append(group["weight_decay"])
            group["weight_decay"] = 0

        self.prior_params = {}
        for i, group in enumerate(self.param_groups):
            for p in group["params"]:
                self.prior_params[id(p)] = p.detach().clone()

    def step(self, closure=None):
        if self.use_prior_wd:
            for i, group in enumerate(self.param_groups):
                for p in group["params"]:
                    if self.exclude_last_group and i == len(self.param_groups):
                        p.data.add_(-group["lr"] * self.weight_decay_by_group[i], p.data)
                    else:
                        p.data.add_(
                            -group["lr"] * self.weight_decay_by_group[i], p.data - self.prior_params[id(p)],
                        )
        loss = self.optim.step(closure)

        return loss

    def compute_distance_to_prior(self, param):
        assert id(param) in self.prior_params, "parameter not in PriorWD optimizer"
        return (param.data - self.prior_params[id(param)]).pow(2).sum().sqrt()

现在我们使用上面定义的简单分组参数和优化参数来创建我们的优化器。

def get_optimizer_grouped_parameters(model, learning_rate, weight_decay):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": weight_decay,
            "lr": learning_rate,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
            "lr": learning_rate,
        },
    ]
    return optimizer_grouped_parameters

optimizer_grouped_parameters = get_optimizer_grouped_parameters(model, learning_rate, weight_decay)
optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=learning_rate,
    eps=adam_epsilon,
    correct_bias=not use_bertadam
)

optimizer = PriorWD(optimizer, use_prior_wd=use_prior_wd)

这现在可以直接用于训练和之前权重衰减将做它的工作。

参考和资源

随机加权平均

介绍

Snapshot ensembling是一种技术，我们在训练同一网络的同时进行权重Snapshot ，然后在训练之后创建一个具有相同架构但权重不同的网络集合。这可以提高测试性能，而且这也是一种非常便宜的方法，因为你只训练一个模型一次，只是不时保存权重。
在SWA(随机加权平均)中，作者提出在加权空间中使用一种新的集合。该方法将同一网络在不同训练阶段的权值进行组合，形成集成，然后利用组合权值的模型进行预测。这种方法有两个好处：

当合并权重时，我们最终仍然得到一个模型，这加快了预测的速度
它可以应用于任何体系结构和数据集，并显示出良好的效果。

想法

SWA的直觉来自实证观察到局部最小值在每个学习速率周期倾向于积累在边境地区表面上损失损失价值很低(分W1 W2和W3低损耗的红色区域的边界图上面的左面板)。
通过取几个这样的点的平均值，有可能获得一个更低损失广泛、可推广的解决方案(上图左面板中Wswa)。

下面是它的工作原理。您只需要两个模型，而不是许多模型的集成

第一个模型，存储模型权重的运行平均值(公式中 $w_swa$ )。这将是训练结束后的最终模型，用于预测。
第二个模型(公式中的w)将遍历权重空间，通过使用循环学习率时间表来探索它。

每个学习速率周期结束时,将使用当前的第二个模型权重更新运行的重量模型通过加权平均值之间的平均权重和运行新老组权重从第二个模型(公式提供的图左边)。
通过这种方法，您只需要训练一个模型，在训练期间只需要在内存中存储两个模型。对于预测，您只需要运行的平均模型，并且对其进行预测要比使用上面描述(在集成中您使用许多模型来预测，然后对结果进行平均)的集成快得多。

实施

我不会在这里实现它，因为这需要它自己单独的内核。主代码如下所示

from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

loader, optimizer, model, loss_fn = ...
swa_start = 5
swa_model = AveragedModel(model)
swa_scheduler = SWALR(optimizer, swa_lr=0.05)
scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
      for input, target in loader:
          optimizer.zero_grad()
          loss_fn(model(input), target).backward()
          optimizer.step()
      if epoch > swa_start:
          swa_model.update_parameters(model)
          swa_scheduler.step()
      else:
          scheduler.step()

# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
# Use swa_model to make predictions on test data 
preds = swa_model(test_input)