《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》论文笔记

凯子要面包

已于 2022-04-28 17:37:54 修改

阅读量353

点赞数

分类专栏： NLP 文章标签： nlp 自然语言处理机器学习

于 2020-08-17 17:21:35 首次发布

本文链接：https://blog.csdn.net/weixin_44815943/article/details/105686849

版权

NLP 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

引言

《Semi-supervised Sequence Learning》

《Semi-supervised Sequence Learning》论文中提出了两种预训练方法，来提升LSTM模型的泛化能力。一种预训练方法，作者称为“sequence autoencoder”，本质就是一种“seq2seq”模型结构，encoder采用RNN，将输入转换成一个向量，decoder使用LSTM，并且将encoder得到的向量，作为decoder输入的初始向量， decoder的任务是做语言模型任务，目标是恢复原始输入。注意，encoder的输入与decoder的输入一样。另外一种预训练方法称为“recurrent language model”，本质就是“sequence autoencoder”不要encoder部分。

sequence autoencoder结构

《Deep contextualized word representations》

论文中提出了EMLO（Embedding From Language Models）预训练方法——试图解决word embedding 是“context independent”的问题，采用了前向、后向两个多层级的LSTM模型做语言模型任务，前向、后向LSTM共享Embedding层与Softmax层参数，其它参数各自独立。然后采用“拼接隐向量”的形式，使得编码向量时，考虑前向与后向的信息。虽然方式很粗暴，但却提供了综合前、后上下文的一种思路。

Bert、GPT与ELMo的异同

三者都是2018年发表的，在这里对三者的异同进行简单的记录。

在水平方向上，ELMo中每层的‘Lstm’指的是LSTM层中的一个单元，同理Bert与GPT中的’Trm’指的是同一个《Attention is all your need》——Transformer单元。其中 Bert只使用了Encoder部分；GPT只使用了Decoder部分。
ELMO 与 GPT 本质上是做 “unidirectional Language model”，而BERT 是做“bidirectional language model”。
fine-tuning approach 是指在目标任务中，模型的所有参数都会调整，包括Embedding层。而feature-based approach 应该是从输入序列中抽取特征，然后基于下游任务，对这些特征进行不同的线性组合，得到最终结果，即预训练涉及的参数在下游任务时，冻结不变，仅调整下游任务的相关参数—— 对feature-based approach 使用的不是很多，仅就文字资料的理解做个记录。

BERT

pretraining & fine-tuning

BERT 是一种 pretraining & fine-tuning 方法，在 pretraining 阶段，做MLM（Mask Language Model）与 NSP（Next Sentence Prediction）任务——也可以做其它任务，比如类似GPT2的Cuasul Language Model。在 fine-tuning 阶段，以预训练阶段的参数进行初始化，不同的下游任务，会有各自的模型层，调整参数时，共享层与任务特有层的所有参数一起调整。
在这里插入图片描述

BERT Input

bert input

序列中每个词的输入包含三部分信息，Token的嵌入、位置编码、所属句子位置编码，这三种嵌入之和为每个词的输入表示；三种嵌入向量都是随机初始化，然后模型自动学习得到的
特殊标识‘CLS’标识classification，经过多层编码后，特殊标识符’CLS’会包含sentence-level的编码信息，该信息可用于分类任务
特殊标识‘SEP’标识两个句子之间的分割，会作为一个特殊标识符进行编码

1.3 两个预训练任务

预训练的损失函数是两个任务的损失函数之和。

任务1：使用Masked Language Model做“完型填空”。

随机遮掩掉原句子中15%的token。
80%被遮掩的部分使用特殊标识 ‘[MASK]'代替、10%被遮掩的部分随机使用另一个词代替、剩下10%被遮掩的部分使用原单词。
虽然Masked Language Model在编码时，实现了真正意义上的双向，但也有一个明显的缺点，就是引起了pre-training与Fine-tuning的不匹配，因为在下游的Fine-tuning任务中，输入是没有 ‘[MASK]'标识符的，通过随机替换1.5%比例（15%乘以10%）的被遮掩词，引入噪声，减轻这种Mismatch的不良影响。另外MLM还有一个缺点，就是在预测时，假设‘[MASK]'部分是相互独立地。
在预测时，仅预测‘[MASK]'部分，而非全部序列。

任务2：给定句对A与B,判断句子B是否是句子A的下一句。

50%的概率，标签为ISNEXT，50%的概率标签为NOTNEXT。
类似Ski-Gram中的负采样思想。
为了弥补语言模型不能捕获句间关系，新增NSP任务，目的是为了服务句对类型的下游任务，如语义相似性。

1.4 Fine-tuning阶段：下游任务

在这里插入图片描述

a)、b)是分类任务，比如a)任务处理句对是否具有相近的意思，b)是单句分类任务，比如影评数据，解决下游的分类任务，只需要提取顶层的特殊标识符’[CLS]'对应的隐向量，将该隐向量接一个全连接层，将隐向量映射到K维，K为类别数，再接Softmax，即可进行分类。
c)任务：给出一个问题Question，并且给出一个段落Paragraph，然后从段落中标出答案的具体位置。需要学习一个开始向量S，维度和输出隐向量维度相同，然后和所有的隐向量做点积，取值最大的词作为开始位置；另外再学一个结束向量E，做同样的运算，得到结束位置。附加一个条件，结束位置一定要大于开始位置。
d)任务：加一层分类网络，对每个输出隐向量都做一次判断。
这些任务，都只需要新增少量的参数，然后在特定数据集上进行训练即可。从实验结果来看，即便是很小的数据集，也能取得不错的效果。
在BERT论文中指出，预训练模型有两种使用策略，Fine-Tuning & Feature-Based。二者的主要区别在于：在Fine-Tuning策略下，预训练模型的所有参数会根据下游任务进行微调，即预训练模型中的所有参数是trainable，并且根据下游特定任务新增的网络层级中的（相对pre-train过程）的少量参数也是trainable；而在Feature-Based策略中，预训练模型中的所有参数是被冻结的，即non-trainable。

Part2-BERT实战

BERT主要使用了transformer中的encoder，然后接自定义头，自定义头根据下游任务确定。自定义头的输入为encoder最后一层输出的隐向量，下面列出带注解的transformer的pytorch实现：

import numpy as np
import torch
from torch import nn


# hyper parameter
d_model = 512
num_heads = 8
num_layers = 6
depth = d_model // num_heads
d_ff = 2048


def get_sinusoid_encoding_table(num_positions, embedding_dim):
    """
    create position embedding table
    :param num_positions: the number of positions
    :param embedding_dim: the dimension for each position
    :return: tensor.shape == (num_position, embedding_dim)
    """
    def cal_angle(position, d_idx):
        """ calculate angle value of each position in single dimension pos. """
        return position / np.power(10000, 2 * (d_idx // 2) / embedding_dim)

    def get_pos_angle_vec(position):
        """ calculate all angle value of each position in every dimension. """
        return [cal_angle(position, d_idx) for d_idx in range(embedding_dim)]

    sinusoid_table = np.array([get_pos_angle_vec(position) for position in range(num_positions)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
    return torch.FloatTensor(sinusoid_table)


def get_attn_pad_mask(seq_q, seq_k):
    """
    mask position which value is padded.
    in the encoder, seq_q and seq_k is same, so the len_q and len_k is equal;
    in the decoder, when the attention is done on self, the seq_q and seq_k is also same, however, when calculate
    attention between encoder_layer_outputs and first_sub_later encoder_layer's output, the seq_q is decoder outputs and
    seq_k. seq_v is encoder outputs.
    """
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()

    # PAD index is 0
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)
    return pad_attn_mask.expand(batch_size, len_q, len_k)


def get_attn_subsequence_mask(seq):
    """ when calculate decoder self attention, the current position can only look before. """
    attn_shape = (seq.size(0), seq.size(1), seq.size(1))
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)
    return torch.from_numpy(subsequence_mask).byte()


class ScaledDotProductAttention(nn.Module):
    """
    the core of MultiHeadAttention. firstly calculating scores of K and Q, then scaled scores; secondly masking
    some score based on attention mask; finally softmax score, multiply score and V to create context vectors.
    """
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    @staticmethod
    def forward(Q, K, V, attn_mask):
        """
        :param Q: (batch_size, num_heads, seq_q, depth_q)
        :param K: (batch_size, num_heads, seq_k, depth_k == depth_q)
        :param V: (batch_size, num_heads, seq_v==seq_k, depth_v)
        :param attn_mask: (batch_size, num_heads, seq_q, seq_k)
        :return: context vector: (batch_size, num_heads, seq_q, depth_v)
        """
        # (batch_size, num_heads, seq_q, seq_k)
        score = torch.matmul(Q, K.transpose(-1, -2))
        score = score / np.sqrt(K.size(-1))

        if attn_mask is not None:
            score.masked_fill_(attn_mask, -1e9)

        attn = torch.softmax(score, dim=-1)

        # (batch_size, num_heads, seq_q, depth_v)
        context = torch.matmul(score, V)
        return context, attn


class MultiHeadAttention(nn.Module):
    """"
    MultiHeadAttention is running parallel. In the transformer, the MultiHeadAttention will be called three times.
    including calculate encoder's self attention, decoder's self attention, attention between encoder and decoder.
    """
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.WQ = nn.Linear(d_model, d_model, bias=False)  # project inputs into Query space
        self.WK = nn.Linear(d_model, d_model, bias=False)  # project inputs into Key space
        self.WV = nn.Linear(d_model, d_model, bias=False)  # project inputs into Value space
        self.fc = nn.Linear(d_model, d_model, bias=False)  # project concatenated heads into new space

    def forward(self, input_q, input_k, input_v, attn_mask):
        batch_size = input_q.size(0)

        # project inputs -> (B, S, D_new) -> (B, S, H, Depth) -> (B, H, S, Depth)
        Q = self.WQ(input_q).view(batch_size, -1, num_heads, depth).transpose(1, 2)
        K = self.WK(input_k).view(batch_size, -1, num_heads, depth).transpose(1, 2)
        V = self.WK(input_v).view(batch_size, -1, num_heads, depth).transpose(1, 2)

        # (B, seq_len, seq_len) -> (B, 1, seq_len, seq_len) -> (B, H, seq_len, seq_len)
        attn_mask = attn_mask.unsqueeze(1).repeat(1, num_heads, 1, 1)

        # context.shape: (batch_size, num_heads, seq_q, depth_v)
        # attn.shape: (batch_size, num_heads, seq_q, seq_k)
        context, attn = ScaledDotProductAttention()(Q, K, V, attn_mask)

        # (batch_size, num_heads, seq_q, depth_v) -> (B, seq_q, H, depth) -> (B, seq_q, d_model)
        context = context.transpose(1, 2).reshape(batch_size, -1, d_model)

        # (B, seq_q, d_model) -> (B, seq_len, d_model)
        output = self.fc(context)

        return nn.LayerNorm(d_model)(output + input_q), attn


class PosWiseFeedForward(nn.Module):
    """ position-wise feed forward network. """
    def __init__(self):
        super(PosWiseFeedForward, self).__init__()
        self.proj = nn.Sequential(
            nn.Linear(d_model, d_ff, bias=True),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, bias=True)
        )

    def forward(self, x):
        return nn.LayerNorm(d_model)(x + self.proj(x))


class EncoderLayer(nn.Module):
    """ consist of MultiHeadAttention and PosWiseFeedFroward. """
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.attn_layer = MultiHeadAttention()
        self.ff_layer = PosWiseFeedForward()

    def forward(self, x, attn_mask):
        """
        encoder has only self attention, so inputs for multi-head attention is all same.
        :param x: (batch_size, src_len, d_model)
        :param attn_mask: (batch_size, src_len, src_len)
        """
        # outputs.shape: (batch_size, src_len ,d_model)
        # attn.shape: (batch_size, num_heads, src_len, src_len)
        outputs, attn = self.attn_layer(x, x, x, attn_mask)
        outputs = self.ff_layer(outputs)
        return outputs, attn


class Encoder(nn.Module):
    """ consist of stacked encoder layer and word_embedding layer and position embedding layer. """
    def __init__(self, source_vocab_size, max_length=256):
        super(Encoder, self).__init__()
        self.position_embedding = nn.Embedding.from_pretrained(
            get_sinusoid_encoding_table(max_length, d_model), freeze=True)

        self.word_embedding = nn.Embedding(source_vocab_size, d_model)
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(num_layers)])

    def forward(self, x):
        """
        x -> embedding(position + word) -> each_encoder_layer
        :param x: (batch_size, src_len)
        :return: (batch_size, src_len, d_model)
        """
        word_embedding = self.word_embedding(x)  # (batch_size, src_len, d_model)

        pos = torch.tensor([[pos_idx for pos_idx in range(x.size(1))] for batch_idx in range(x.size(0))],
                           dtype=torch.int).unsqueeze(0)
        position_embedding = self.position_embedding(pos)  # (batch_size, src_len, d_model)
        outputs = word_embedding + position_embedding

        # attn_maks.shape: (batch_size, src_len, src_len)
        attn_mask = get_attn_pad_mask(x, x)

        encoder_attentions = list()
        for layer in self.layers:
            outputs, attn = layer(outputs, attn_mask)
            encoder_attentions.append(attn)

        return outputs, encoder_attentions


class DecoderLayer(nn.Module):
    """
    consist of MultiHeadAttention which is used in calculating self attention and cross attention
    and PosWiseFeedForward.
    """
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.self_attn_layer = MultiHeadAttention()
        self.cross_attn_layer = MultiHeadAttention()
        self.ff_layer = PosWiseFeedForward()

    def forward(self, decoder_input, encoder_outputs, decoder_attn_mask, decoder_encoder_attn_mask):
        """
        inputs -> cal self_attention -> cal corss_attention -> project previous outputs
        :param decoder_input: (batch_size, tgt_len, d_model)
        :param encoder_outputs: (batch_size, src_len, d_model)
        :param decoder_attn_mask: (batch_size, tgt_len, tgt_len)
        :param decoder_encoder_attn_mask: (batch_size, tgt_len, src_len)
        """
        # decoder_output.shape: (batch_size, tgt_len ,d_model)
        # decoder_attn.shape: (batch_size, num_heads, tgt_len, tgt_len)
        decoder_outputs, decoder_self_attn\
            = self.self_attn_layer(decoder_input, decoder_input, decoder_input, decoder_attn_mask)

        # decoder_outputs.shape: (batch_size, tgt_len, d_model)
        # decoder_cross_attn.shape: (batch_size, num_heads, tgt_len, src_len)
        decoder_outputs, decoder_cross_attn = \
            self.cross_attn_layer(decoder_outputs, encoder_outputs, encoder_outputs, decoder_encoder_attn_mask)

        decoder_outputs = self.ff_layer(decoder_outputs)
        return decoder_outputs, decoder_self_attn, decoder_cross_attn


class Decoder(nn.Module):
    """ consist of stacked decoder layer and embedding layer(word + position). """
    def __init__(self, target_vocab_size, max_length=256):
        super(Decoder, self).__init__()
        self.word_embedding = nn.Embedding(target_vocab_size, d_model)
        self.position_embedding = nn.Embedding.from_pretrained(
            get_sinusoid_encoding_table(max_length, d_model), freeze=True)

        self.layers = nn.ModuleList([DecoderLayer() for _ in range(num_layers)])

    def forward(self, decoder_inputs, encoder_inputs, encoder_outputs):
        """
        x -> embedding -> each decoder layer
        :param decoder_inputs: (batch_size, tgt_len)
        :param encoder_inputs: (batch_size, src_len)
        :param encoder_outputs: (batch_size, src_len, d_model)
        """
        word_embedding = self.word_embedding(decoder_inputs)

        pos = torch.tensor([[pos_idx for pos_idx in range(decoder_inputs.size(1))]
                            for _ in range(decoder_inputs.size(0))], dtype=torch.int).unsqueeze(0)
        position_embedding = self.position_embedding(pos)
        decoder_outputs = word_embedding + position_embedding  # (batch_size, tgt_len, d_model)

        decoder_self_attn = get_attn_pad_mask(decoder_inputs, decoder_inputs)  # (batch_size, tgt_len, tgt_len)
        decoder_subsequence_attn = get_attn_subsequence_mask(decoder_inputs)  # (batch_size, tgt_len, tgt_len)

        # combine self attention and subsequence attention in first sublayer of decoder layer
        decoder_self_attn = torch.gt((decoder_self_attn + decoder_subsequence_attn), 0)

        decoder_cross_attn = get_attn_pad_mask(decoder_inputs, encoder_inputs)  # (batch_size, tgt_len, src_len)

        decoder_self_attentions, decoder_cross_attentions = list(), list()
        for layer in self.layers:
            # decoder_outputs.shape: (batch_size, tgt_len, d_model)
            # self_attn.shape: (batch_size, num_heads, tgt_len, tgt_len)
            # cross_attn.shape: (batch_size, num_heads, tgt_len, src_len)
            decoder_outputs, self_attn, cross_attn =\
                layer(decoder_outputs, encoder_outputs, decoder_self_attn, decoder_cross_attn)

            decoder_self_attentions.append(self_attn)
            decoder_cross_attentions.append(cross_attn)

        return decoder_outputs, decoder_self_attentions, decoder_cross_attentions


class Transformer(nn.Module):
    """ https://blog.csdn.net/qq_37236745/article/details/107352273 """
    def __init__(self, source_vocab_size, target_vocab_size):
        super(Transformer, self).__init__()
        self.encoder = Encoder(source_vocab_size=source_vocab_size)
        self.decoder = Decoder(target_vocab_size=target_vocab_size)
        self.proj_layer = nn.Linear(d_model, target_vocab_size, bias=False)

    def forward(self, encoder_inputs, decoder_inputs):
        """
        :param encoder_inputs: (batch_size, src_len)
        :param decoder_inputs: (batch_size, tag_len)
        """
        # encoder_outputs.shape: (batch_size, src_len ,d_model)
        # encoder_attentions.shape: (batch_size, num_heads, src_len, src_len)
        encoder_outputs, encoder_attentions = self.encoder(encoder_inputs)

        # decoder_outputs.shape: (batch_size, tgt_len, d_model)
        decoder_outputs, decoder_self_attentions, decoder_cross_attentions =\
            self.decoder(decoder_inputs, encoder_inputs, encoder_outputs)

        decoder_outputs = self.proj_layer(decoder_outputs)
        return decoder_outputs, encoder_attentions, decoder_self_attentions, decoder_cross_attentions

参考资料：

1. BERT和Transformer理解及测试
 2. 自然语言处理中的Transformer和BERT
3. Bert系列
 4. Fine-Tuning 与 Feature-Based strategy 的区别
5. https://blog.csdn.net/qq_37236745/article/details/107352273

凯子要面包

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》论文笔记

Bert原理Bert、GPT与ELMo的结构差异在水平方向上，ELMo中每层的‘Lstm’指的是LSTM层中的一个单元，同理Bert与GPT中的’Trm’指的是同一个Transformer单元；Bert中的Transformer单元，只使用了《Attention is all your need》中Transformer的Encoder部分Bert实战参考资料：...
复制链接

扫一扫