《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》论文笔记

引言

《Semi-supervised Sequence Learning》

《Semi-supervised Sequence Learning》论文中提出了两种预训练方法,来提升LSTM模型的泛化能力。一种预训练方法,作者称为“sequence autoencoder”,本质就是一种“seq2seq”模型结构,encoder采用RNN,将输入转换成一个向量,decoder使用LSTM,并且将encoder得到的向量,作为decoder输入的初始向量, decoder的任务是做语言模型任务, 目标是恢复原始输入。注意,encoder的输入与decoder的输入一样。另外一种预训练方法称为“recurrent language model”,本质就是“sequence autoencoder”不要encoder部分。

sequence autoencoder结构

《Deep contextualized word representations》

论文中提出了EMLO(Embedding From Language Models)预训练方法——试图解决word embedding 是“context independent”的问题,采用了前向、后向两个多层级的LSTM模型做语言模型任务,前向、后向LSTM共享Embedding层与Softmax层参数,其它参数各自独立。然后采用“拼接隐向量”的形式,使得编码向量时,考虑前向与后向的信息。虽然方式很粗暴,但却提供了综合前、后上下文的一种思路。
 

Bert、GPT与ELMo的异同

三者都是2018年发表的,在这里对三者的异同进行简单的记录。

  • 在水平方向上,ELMo中每层的‘Lstm’指的是LSTM层中的一个单元,同理Bert与GPT中的’Trm’指的是同一个《Attention is all your need》——Transformer单元。其中 Bert只使用了Encoder部分;GPT只使用了Decoder部分。
  • ELMO 与 GPT 本质上是做 “unidirectional Language model”, 而BERT 是做“bidirectional language model”。
  • fine-tuning approach 是指在目标任务中,模型的所有参数都会调整,包括Embedding层。而feature-based approach 应该是从输入序列中抽取特征,然后基于下游任务,对这些特征进行不同的线性组合,得到最终结果,即预训练涉及的参数在下游任务时,冻结不变,仅调整下游任务的相关参数—— 对feature-based approach 使用的不是很多,仅就文字资料的理解做个记录。
     

BERT

pretraining & fine-tuning

BERT 是一种 pretraining & fine-tuning 方法,在 pretraining 阶段,做MLM(Mask Language Model)与 NSP(Next Sentence Prediction)任务——也可以做其它任务,比如类似GPT2的Cuasul Language Model。在 fine-tuning 阶段,以预训练阶段的参数进行初始化,不同的下游任务,会有各自的模型层, 调整参数时,共享层与任务特有层的所有参数一起调整。
在这里插入图片描述

BERT Input

bert input

  • 序列中每个词的输入包含三部分信息,Token的嵌入、位置编码、所属句子位置编码,这三种嵌入之和为每个词的输入表示;三种嵌入向量都是随机初始化,然后模型自动学习得到的
  • 特殊标识‘CLS’标识classification,经过多层编码后,特殊标识符’CLS’会包含sentence-level的编码信息,该信息可用于分类任务
  • 特殊标识‘SEP’标识两个句子之间的分割,会作为一个特殊标识符进行编码
     

1.3 两个预训练任务

预训练的损失函数是两个任务的损失函数之和。

任务1:使用Masked Language Model做“完型填空”。

  • 随机遮掩掉原句子中15%的token。
  • 80%被遮掩的部分使用特殊标识 ‘[MASK]'代替、10%被遮掩的部分随机使用另一个词代替、剩下10%被遮掩的部分使用原单词。
  • 虽然Masked Language Model在编码时,实现了真正意义上的双向,但也有一个明显的缺点,就是引起了pre-training与Fine-tuning的不匹配,因为在下游的Fine-tuning任务中,输入是没有 ‘[MASK]'标识符的,通过随机替换1.5%比例(15%乘以10%)的被遮掩词,引入噪声,减轻这种Mismatch的不良影响。另外MLM还有一个缺点,就是在预测时,假设‘[MASK]'部分是相互独立地。
  • 在预测时,仅预测‘[MASK]'部分, 而非全部序列。

任务2:给定句对A与B,判断句子B是否是句子A的下一句。

  • 50%的概率,标签为ISNEXT,50%的概率标签为NOTNEXT。
  • 类似Ski-Gram中的负采样思想。
  • 为了弥补语言模型不能捕获句间关系,新增NSP任务,目的是为了服务句对类型的下游任务,如语义相似性。
     

1.4 Fine-tuning阶段:下游任务

在这里插入图片描述

  • a)、b)是分类任务,比如a)任务处理句对是否具有相近的意思,b)是单句分类任务,比如影评数据,解决下游的分类任务,只需要提取顶层的特殊标识符’[CLS]'对应的隐向量,将该隐向量接一个全连接层,将隐向量映射到K维,K为类别数,再接Softmax,即可进行分类。
  • c)任务:给出一个问题Question,并且给出一个段落Paragraph,然后从段落中标出答案的具体位置。需要学习一个开始向量S,维度和输出隐向量维度相同,然后和所有的隐向量做点积,取值最大的词作为开始位置;另外再学一个结束向量E,做同样的运算,得到结束位置。附加一个条件,结束位置一定要大于开始位置。
  • d)任务:加一层分类网络,对每个输出隐向量都做一次判断。
  • 这些任务,都只需要新增少量的参数,然后在特定数据集上进行训练即可。从实验结果来看,即便是很小的数据集,也能取得不错的效果。
    在BERT论文中指出,预训练模型有两种使用策略,Fine-Tuning & Feature-Based。二者的主要区别在于:在Fine-Tuning策略下,预训练模型的所有参数会根据下游任务进行微调,即预训练模型中的所有参数是trainable,并且根据下游特定任务新增的网络层级中的(相对pre-train过程)的少量参数也是trainable;而在Feature-Based策略中,预训练模型中的所有参数是被冻结的,即non-trainable。
     

Part2-BERT实战

BERT主要使用了transformer中的encoder,然后接自定义头,自定义头根据下游任务确定。自定义头的输入为encoder最后一层输出的隐向量,下面列出带注解的transformer的pytorch实现:

import numpy as np
import torch
from torch import nn


# hyper parameter
d_model = 512
num_heads = 8
num_layers = 6
depth = d_model // num_heads
d_ff = 2048


def get_sinusoid_encoding_table(num_positions, embedding_dim):
    """
    create position embedding table
    :param num_positions: the number of positions
    :param embedding_dim: the dimension for each position
    :return: tensor.shape == (num_position, embedding_dim)
    """
    def cal_angle(position, d_idx):
        """ calculate angle value of each position in single dimension pos. """
        return position / np.power(10000, 2 * (d_idx // 2) / embedding_dim)

    def get_pos_angle_vec(position):
        """ calculate all angle value of each position in every dimension. """
        return [cal_angle(position, d_idx) for d_idx in range(embedding_dim)]

    sinusoid_table = np.array([get_pos_angle_vec(position) for position in range(num_positions)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
    return torch.FloatTensor(sinusoid_table)


def get_attn_pad_mask(seq_q, seq_k):
    """
    mask position which value is padded.
    in the encoder, seq_q and seq_k is same, so the len_q and len_k is equal;
    in the decoder, when the attention is done on self, the seq_q and seq_k is also same, however, when calculate
    attention between encoder_layer_outputs and first_sub_later encoder_layer's output, the seq_q is decoder outputs and
    seq_k. seq_v is encoder outputs.
    """
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()

    # PAD index is 0
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)
    return pad_attn_mask.expand(batch_size, len_q, len_k)


def get_attn_subsequence_mask(seq):
    """ when calculate decoder self attention, the current position can only look before. """
    attn_shape = (seq.size(0), seq.size(1), seq.size(1))
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)
    return torch.from_numpy(subsequence_mask).byte()


class ScaledDotProductAttention(nn.Module):
    """
    the core of MultiHeadAttention. firstly calculating scores of K and Q, then scaled scores; secondly masking
    some score based on attention mask; finally softmax score, multiply score and V to create context vectors.
    """
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    @staticmethod
    def forward(Q, K, V, attn_mask):
        """
        :param Q: (batch_size, num_heads, seq_q, depth_q)
        :param K: (batch_size, num_heads, seq_k, depth_k == depth_q)
        :param V: (batch_size, num_heads, seq_v==seq_k, depth_v)
        :param attn_mask: (batch_size, num_heads, seq_q, seq_k)
        :return: context vector: (batch_size, num_heads, seq_q, depth_v)
        """
        # (batch_size, num_heads, seq_q, seq_k)
        score = torch.matmul(Q, K.transpose(-1, -2))
        score = score / np.sqrt(K.size(-1))

        if attn_mask is not None:
            score.masked_fill_(attn_mask, -1e9)

        attn = torch.softmax(score, dim=-1)

        # (batch_size, num_heads, seq_q, depth_v)
        context = torch.matmul(score, V)
        return context, attn


class MultiHeadAttention(nn.Module):
    """"
    MultiHeadAttention is running parallel. In the transformer, the MultiHeadAttention will be called three times.
    including calculate encoder's self attention, decoder's self attention, attention between encoder and decoder.
    """
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.WQ = nn.Linear(d_model, d_model, bias=False)  # project inputs into Query space
        self.WK = nn.Linear(d_model, d_model, bias=False)  # project inputs into Key space
        self.WV = nn.Linear(d_model, d_model, bias=False)  # project inputs into Value space
        self.fc = nn.Linear(d_model, d_model, bias=False)  # project concatenated heads into new space

    def forward(self, input_q, input_k, input_v, attn_mask):
        batch_size = input_q.size(0)

        # project inputs -> (B, S, D_new) -> (B, S, H, Depth) -> (B, H, S, Depth)
        Q = self.WQ(input_q).view(batch_size, -1, num_heads, depth).transpose(1, 2)
        K = self.WK(input_k).view(batch_size, -1, num_heads, depth).transpose(1, 2)
        V = self.WK(input_v).view(batch_size, -1, num_heads, depth).transpose(1, 2)

        # (B, seq_len, seq_len) -> (B, 1, seq_len, seq_len) -> (B, H, seq_len, seq_len)
        attn_mask = attn_mask.unsqueeze(1).repeat(1, num_heads, 1, 1)

        # context.shape: (batch_size, num_heads, seq_q, depth_v)
        # attn.shape: (batch_size, num_heads, seq_q, seq_k)
        context, attn = ScaledDotProductAttention()(Q, K, V, attn_mask)

        # (batch_size, num_heads, seq_q, depth_v) -> (B, seq_q, H, depth) -> (B, seq_q, d_model)
        context = context.transpose(1, 2).reshape(batch_size, -1, d_model)

        # (B, seq_q, d_model) -> (B, seq_len, d_model)
        output = self.fc(context)

        return nn.LayerNorm(d_model)(output + input_q), attn


class PosWiseFeedForward(nn.Module):
    """ position-wise feed forward network. """
    def __init__(self):
        super(PosWiseFeedForward, self).__init__()
        self.proj = nn.Sequential(
            nn.Linear(d_model, d_ff, bias=True),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, bias=True)
        )

    def forward(self, x):
        return nn.LayerNorm(d_model)(x + self.proj(x))


class EncoderLayer(nn.Module):
    """ consist of MultiHeadAttention and PosWiseFeedFroward. """
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.attn_layer = MultiHeadAttention()
        self.ff_layer = PosWiseFeedForward()

    def forward(self, x, attn_mask):
        """
        encoder has only self attention, so inputs for multi-head attention is all same.
        :param x: (batch_size, src_len, d_model)
        :param attn_mask: (batch_size, src_len, src_len)
        """
        # outputs.shape: (batch_size, src_len ,d_model)
        # attn.shape: (batch_size, num_heads, src_len, src_len)
        outputs, attn = self.attn_layer(x, x, x, attn_mask)
        outputs = self.ff_layer(outputs)
        return outputs, attn


class Encoder(nn.Module):
    """ consist of stacked encoder layer and word_embedding layer and position embedding layer. """
    def __init__(self, source_vocab_size, max_length=256):
        super(Encoder, self).__init__()
        self.position_embedding = nn.Embedding.from_pretrained(
            get_sinusoid_encoding_table(max_length, d_model), freeze=True)

        self.word_embedding = nn.Embedding(source_vocab_size, d_model)
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(num_layers)])

    def forward(self, x):
        """
        x -> embedding(position + word) -> each_encoder_layer
        :param x: (batch_size, src_len)
        :return: (batch_size, src_len, d_model)
        """
        word_embedding = self.word_embedding(x)  # (batch_size, src_len, d_model)

        pos = torch.tensor([[pos_idx for pos_idx in range(x.size(1))] for batch_idx in range(x.size(0))],
                           dtype=torch.int).unsqueeze(0)
        position_embedding = self.position_embedding(pos)  # (batch_size, src_len, d_model)
        outputs = word_embedding + position_embedding

        # attn_maks.shape: (batch_size, src_len, src_len)
        attn_mask = get_attn_pad_mask(x, x)

        encoder_attentions = list()
        for layer in self.layers:
            outputs, attn = layer(outputs, attn_mask)
            encoder_attentions.append(attn)

        return outputs, encoder_attentions


class DecoderLayer(nn.Module):
    """
    consist of MultiHeadAttention which is used in calculating self attention and cross attention
    and PosWiseFeedForward.
    """
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.self_attn_layer = MultiHeadAttention()
        self.cross_attn_layer = MultiHeadAttention()
        self.ff_layer = PosWiseFeedForward()

    def forward(self, decoder_input, encoder_outputs, decoder_attn_mask, decoder_encoder_attn_mask):
        """
        inputs -> cal self_attention -> cal corss_attention -> project previous outputs
        :param decoder_input: (batch_size, tgt_len, d_model)
        :param encoder_outputs: (batch_size, src_len, d_model)
        :param decoder_attn_mask: (batch_size, tgt_len, tgt_len)
        :param decoder_encoder_attn_mask: (batch_size, tgt_len, src_len)
        """
        # decoder_output.shape: (batch_size, tgt_len ,d_model)
        # decoder_attn.shape: (batch_size, num_heads, tgt_len, tgt_len)
        decoder_outputs, decoder_self_attn\
            = self.self_attn_layer(decoder_input, decoder_input, decoder_input, decoder_attn_mask)

        # decoder_outputs.shape: (batch_size, tgt_len, d_model)
        # decoder_cross_attn.shape: (batch_size, num_heads, tgt_len, src_len)
        decoder_outputs, decoder_cross_attn = \
            self.cross_attn_layer(decoder_outputs, encoder_outputs, encoder_outputs, decoder_encoder_attn_mask)

        decoder_outputs = self.ff_layer(decoder_outputs)
        return decoder_outputs, decoder_self_attn, decoder_cross_attn


class Decoder(nn.Module):
    """ consist of stacked decoder layer and embedding layer(word + position). """
    def __init__(self, target_vocab_size, max_length=256):
        super(Decoder, self).__init__()
        self.word_embedding = nn.Embedding(target_vocab_size, d_model)
        self.position_embedding = nn.Embedding.from_pretrained(
            get_sinusoid_encoding_table(max_length, d_model), freeze=True)

        self.layers = nn.ModuleList([DecoderLayer() for _ in range(num_layers)])

    def forward(self, decoder_inputs, encoder_inputs, encoder_outputs):
        """
        x -> embedding -> each decoder layer
        :param decoder_inputs: (batch_size, tgt_len)
        :param encoder_inputs: (batch_size, src_len)
        :param encoder_outputs: (batch_size, src_len, d_model)
        """
        word_embedding = self.word_embedding(decoder_inputs)

        pos = torch.tensor([[pos_idx for pos_idx in range(decoder_inputs.size(1))]
                            for _ in range(decoder_inputs.size(0))], dtype=torch.int).unsqueeze(0)
        position_embedding = self.position_embedding(pos)
        decoder_outputs = word_embedding + position_embedding  # (batch_size, tgt_len, d_model)

        decoder_self_attn = get_attn_pad_mask(decoder_inputs, decoder_inputs)  # (batch_size, tgt_len, tgt_len)
        decoder_subsequence_attn = get_attn_subsequence_mask(decoder_inputs)  # (batch_size, tgt_len, tgt_len)

        # combine self attention and subsequence attention in first sublayer of decoder layer
        decoder_self_attn = torch.gt((decoder_self_attn + decoder_subsequence_attn), 0)

        decoder_cross_attn = get_attn_pad_mask(decoder_inputs, encoder_inputs)  # (batch_size, tgt_len, src_len)

        decoder_self_attentions, decoder_cross_attentions = list(), list()
        for layer in self.layers:
            # decoder_outputs.shape: (batch_size, tgt_len, d_model)
            # self_attn.shape: (batch_size, num_heads, tgt_len, tgt_len)
            # cross_attn.shape: (batch_size, num_heads, tgt_len, src_len)
            decoder_outputs, self_attn, cross_attn =\
                layer(decoder_outputs, encoder_outputs, decoder_self_attn, decoder_cross_attn)

            decoder_self_attentions.append(self_attn)
            decoder_cross_attentions.append(cross_attn)

        return decoder_outputs, decoder_self_attentions, decoder_cross_attentions


class Transformer(nn.Module):
    """ https://blog.csdn.net/qq_37236745/article/details/107352273 """
    def __init__(self, source_vocab_size, target_vocab_size):
        super(Transformer, self).__init__()
        self.encoder = Encoder(source_vocab_size=source_vocab_size)
        self.decoder = Decoder(target_vocab_size=target_vocab_size)
        self.proj_layer = nn.Linear(d_model, target_vocab_size, bias=False)

    def forward(self, encoder_inputs, decoder_inputs):
        """
        :param encoder_inputs: (batch_size, src_len)
        :param decoder_inputs: (batch_size, tag_len)
        """
        # encoder_outputs.shape: (batch_size, src_len ,d_model)
        # encoder_attentions.shape: (batch_size, num_heads, src_len, src_len)
        encoder_outputs, encoder_attentions = self.encoder(encoder_inputs)

        # decoder_outputs.shape: (batch_size, tgt_len, d_model)
        decoder_outputs, decoder_self_attentions, decoder_cross_attentions =\
            self.decoder(decoder_inputs, encoder_inputs, encoder_outputs)

        decoder_outputs = self.proj_layer(decoder_outputs)
        return decoder_outputs, encoder_attentions, decoder_self_attentions, decoder_cross_attentions

 

参考资料:

1. BERT和Transformer理解及测试
2. 自然语言处理中的Transformer和BERT
3. Bert系列
4. Fine-Tuning 与 Feature-Based strategy 的区别
5. https://blog.csdn.net/qq_37236745/article/details/107352273

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值