Bert代码详解及模型介绍

最新推荐文章于 2024-08-09 09:55:43 发布

小樊努力努力再努力

最新推荐文章于 2024-08-09 09:55:43 发布

阅读量8.0k

点赞数 22

文章标签： python nlp bert

本文链接：https://blog.csdn.net/weixin_51130521/article/details/124230861

版权

本文深入解析BERT模型的PyTorch实现，重点介绍BertModel的前向传播过程，包括词嵌入、Transformer编码器和池化层。详细解释了各参数含义，如input_ids、attention_mask、token_type_ids等，并探讨了模型的返回值。通过BertEmbeddings、BertEncoder、BertLayer等关键组件，逐步剖析BERT的工作原理。

摘要由CSDN通过智能技术生成

前言

写在前边：很多东西我也仅仅是一知半解，仅仅记录学习过程，个人观点，还需要看大量的代码。必须要耐住性子看代码，别无他法，看的多了自然就会了。

因为个人不用tensorflow，so这是bert的pytorch版本，地址：https://github.com/huggingface/pytorch-pretrained-BERT 主要内容在pytorch_pretrained_bert/modeling文件中。、

BertModel 主要为 transformer encoder 结构，包含三个部分：

embeddings，即BertEmbeddings类的实体，对应词嵌入；

encoder，即BertEncoder类的实体；

pooler，即BertPooler类的实体，这一部分是可选的。

补充：注意 BertModel 也可以配置为 Decoder

介绍 BertModel 的前向传播过程中各个参数的含义以及返回值：

 def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ): ...

input_ids：经过 tokenizer 分词后的 subword 对应的下标列表；

attention_mask：在 self-attention 过程中，这一块 mask 用于标记 subword 所处句子和 padding 的区别，将 padding 部分填充为 0；

token_type_ids：标记 subword 当前所处句子（第一句/第二句/ padding）；

position_ids：标记当前词所在句子的位置下标；

head_mask：用于将某些层的某些注意力计算无效化；

inputs_embeds：如果提供了，那就不需要input_ids，跨过 embedding lookup 过程直接作为 Embedding 进入 Encoder 计算；

encoder_hidden_states：这一部分在 BertModel 配置为 decoder 时起作用，将执行 cross-attention 而不是 self-attention；

encoder_attention_mask：同上，在 cross-attention 中用于标记 encoder 端输入的 padding；

past_key_values：这个参数貌似是把预先计算好的 K-V 乘积传入，以降低 cross-attention 的开销（因为原本这部分是重复计算）；

use_cache：将保存上一个参数并传回，加速 decoding；

output_attentions：是否返回中间每层的 attention 输出；

output_hidden_states：是否返回中间每层的输出；

return_dict：是否按键值对的形式（ModelOutput 类，也可以当作 tuple 用）返回输出，默认为真。

补充：注意，这里的 head_mask 对注意力计算的无效化，和下文提到的注意力头剪枝不同，而仅仅把某些注意力的计算结果给乘以这一系数。

返回部分如下：

    # BertModel的前向传播返回部分
        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndCrossAttentions(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            past_key_values=encoder_outputs.past_key_values,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

可以看出，返回值不但包含了 encoder 和 pooler 的输出，也包含了其他指定输出的部分（hidden_states 和 attention 等，这一部分在encoder_outputs[1:]）方便取用：

       # BertEncoder的前向传播返回部分，即上面的encoder_outputs
        if not return_dict:
            return tuple(
                v
                for v in [
                    hidden_states,
                    next_decoder_cache,
                    all_hidden_states,
                    all_self_attentions,
                    all_cross_attentions,
                ]
                if v is not None
            )
        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attentions,
            cross_attentions=all_cross_attentions,
        )

BertModel 流程详解从BertModel的forward函数开始

第一步：整理输入

#将attention_mask变成（batch_size, 1, 1, to_seq_length） 
#(to be completed)
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
 
#原本的mask中，1代表有用信息，0代表填充信息。下面的这句代码将其更改为：0代表有用信息，-10000代表填充信息。（为什么？从最后的softmax函数出发考虑）
#（to be completed）
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

#将input_ids和token_type_ids输入到embeddings层，构造下一层的输入
embedding_output = self.embeddings(input_ids, token_type_ids)

BertEmbeddings层详细解释

#输入为input_ids和token_type_ids，其维度均为（batch_size, seq_length）
#.....................................................................
#生成positions_ids
#如果一句话的长度是seq_length，那么生成的positions_id就是【0，1，2，......，seq_length - 1】
#positions_id其实就是为了构造论文中提到的Position_Embeddings
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
#变成和input_ids一样的形状
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
#.....................................................................
 
 
#....................................................................
#如果输入token_type_ids为None的话，则默认整个输入都是a句。（a句的含义请看论文解读）
if token_type_ids is None:
    token_type_ids = torch.zeros_like(input_ids)
 
#....................................................................
 
 
#....................................................................
#此三句的含义是根据相应的输入获得三种embeddings(对应论文的三种embedding)
#word_embeddings是nn的一个内置函数（方法？），其作用是根据输入，产生相应的embedding,网上查其用法，很简单。
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
#....................................................................
 
 
#.....................................................................
#论文中重要的一步，将三种embedding相加作为这个单词的代表。
embeddings = words_embeddings + position_embeddings + token_type_embeddings
#.....................................................................
 
#.........................................................................
#将结果输入到layer_normer层和dropout层进行处理，得到最后的输出并返回
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
#.........................................................................

BertLayerNorm层详细解释

#layerNorm 和batchNorm的区别和作用网上有解释
u = x.mean(-1, keepdim=True)
s = (x - u).pow(2).mean(-1, keepdim=True)
x = (x - u) / torch.sqrt(s + self.variance_epsilon)
return self.weight * x + self.bias
#上述的代码其实就是下面这个公式,其中x是向量，u是标量（均值），分母代表标准差。（参考概率论详解此公式含义）
#关于代码中为什么要添加variance_epsilon,这是一个很小得数，是为了防止分母（平方差）为0.

第二步：进入transformer层，这也是最重要的一层

从embedding从输出后，直接进入encoder层

#从embeddings层得到输出，然后送进encoder层，得到最后的输出encoder_layers
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output, extended_attention_mask,output_all_encoded_layers=output_all_encoded_layers)

BertEncoder层详细解释

#BertEncoder层建立了整个transformer构架
#Transformer构架参考：https://zhuanlan.zhihu.com/p/39034683        （BE CAUTIOUS!）
#现在我假设大家都知道了这个架构，我这里沿袭了上面知乎中某些专有名词的称呼
 
#........................................................................
#Transformer中包含若干层(论文中base为12层，large为24层)encoder,每层encoder在代码中就是一个BertLayer。
#所以下面的代码首先声明了一层layer,然后构造了num_hidden_layers(12 or 24)层相同的layer放在一个列表中，既是self.layer
layer = BertLayer(config)
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
#........................................................................
 
#........................................................................
#下面看其forward函数
def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
#看其输入：
#hidden_states:根据上面所讲，hidden_states就是embedding_output，其维度为[batch_size, seq_lenght, word_dimension],embedding出来后，多了一个dimension
#attention_mask:维度[batch_size, 1, 1, seq_length]
#(to be completed)
#output_all_encoder_layers:此函数的输出模式，下面会详细讲解
 
#这个函数到底做了什么了？其实很简单，就是做了一个循环，将每一个encoder的输出作为输入输给下一层的encoder，直到12（or24）层循环完毕
    all_encoder_layers = []
    #遍历所有的encoder,总共有12层或者24层
    for layer_module in self.layer:
        #每一层的输出hidden_states也是下一层layer_moudle（BertLayer）的输入，这样就连接起来了各层encoder。第一层的输入是embedding_output
        hidden_states = layer_module(hidden_states, attention_mask)
        #如果output_all_encoded_layers == True:则将每一层的结果添加到all_encoder_layers中
        if output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)
    #如果output_all_encoded_layers == False, 则只将最后一层的输出加到all_encoded_layers中
    if not output_all_encoded_layers:
        all_encoder_layers.append(hidden_states)
    return all_encoder_layers
#所以output_all_encoded_layers是用来控制输出模式的。
#这样整个transformer的框架就出来了，下面将讲述框架中的每一层encoder（即BertLayer）是怎么构造的
#........................................................................

BertLayer层详细解释

BertLayer层是最麻烦的一层，因为其中反复调用其他层，需要耐心理清头绪

#每一层BertLayer都是一个encoder，从上面讲解可知，他的输入是hidden_states和attention_mask，并生成下层所需要的输入：一个新的hidden_states。
#那么hidden_states在BertLayers里面到底经历了什么呢？这个要分成三个部分来讲：
#1、经过attention层（传说中的self_attention）
attention_output = self.attention(hidden_states, attention_mask)
#2、一个中间层
#（to be completed）
intermediate_output = self.intermediate(attention_output)
#3、一个输出层，然后返回
#(to be completed)
layer_output = self.output(intermediate_output, attention_output)
 
#下面将详细讲解这三层，首先是传说中的attention层

BertAttention层详细解释

不幸的是，这个attention层又tm引用了其他层，一环套一环。为了看起来方便，我决定将这引用层的讲解一并放到这个jupyter cell中讲解，而不像之前那样一个model放在一个jupyter cell中。

#............................................
#BertAttention的输入是两个：一个是input_tensor（之前的hidden_states，第一层是embedding_output），维度为[batch_size, seq_length, word_dimension]
#另一个则是attention_mask:其维度为（batch_size, 1, 1, seq_length）
#进入BertAttention层之后，首先进入BertSelfAttention层，再连接一个BertSelfOutput层，然后得到输出
def forward(self, input_tensor, attention_mask):
    self_output = self.self(input_tensor, attention_mask)     #BertSelfAttention层
    attention_output = self.output(self_output, input_tensor)  #BertSelfOutput层
    return attention_output
#............................................
 
 
 
#.....................................................................
#下面则是激动人心的selfattention层，没有单独放在一个jupyter cell中显得很没有排面……
#attention层非常的复杂，以至于我不得不先讲解这层init方法，这层的init涉及很多参数
 
        #num_attention_heads: Number of attention heads for each attention layer in the Transformer encoder
        #头的数目，代码中给定为12，我的理解是12个头就类似于12个transformer（不是encoder）,将每个transformer的结果合并，才是最后的结果
        self.num_attention_heads = config.num_attention_heads
        
        #attention_hidden_size:每个头的大小,有总大小（hidden_size，768）除以总头数获得，既是768/12=64
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        #all_head_size似乎和hidden_size的大小是相同的，不知道为什么要多此一举（都代表着总大小768）
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        
        #这里相当于声明了一个hidden_size * all_head_size大小的矩阵， 既是768*768
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)
        #dropout层
        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
 
#下面的forward函数是重中之重，attention到底是怎么做的！！！
 
    #首先看输入，hidden_states:(batch_size, seq_length, word_dimension = hidden_size = 768)(仔细看embedding代码，确实输出的维度是hiddensize的)
    #另一个输入：attention_mask(batch_size, 1, 1, seq_length)
    def forward(self, hidden_states, attention_mask):
        
        #简单提一下query, key，value的作用。简单来说query和key用来确定权重，然后乘以value用来得到注意力的大小
        #详细解释还是得看我给的那个网站，下面看代码
        
        #首先是经过简单的矩阵相乘处理（这些矩阵就是我们要训练的东西）
        #下面三行均是[batch_size, seq_length, hidden_states]*[hidden_states, all_head_size]
        #结果是[batch_size, seq_length, all_head_size = 768]
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)
        
        
        #下面也是同样的操作，transpose_for_scores
        #这个操作干了什么呢？把[batch_size, seq_length, all_head_size = 768] 的矩阵变成了
        #[batch_size, num_attention_heads=12, seq_length, attention_head_size=64]
        #具体相应的代码我就不讲了，占空间（我懒）
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)
 
        #下面四行代码（注意是四行代码）是计算权重用的。
        #首先query和key相乘，得到的矩阵形状是[batch_size, num_attention_heads, seq_length, seq_length]
        #看到这个形状，有没有想起什么呢？我来解释一下：
        #首先看后两维A[seq_length, seq_length]，自注意力机制是自己对自己的注意力，假设一个句子长度是seq_length,那么这个二维矩阵代表什么呢？
        #A[0][0]可以看作这句话第0个单词对第0个单词的影响（注意力）权重，A[0][1]代表第1个单词对第0个单词的影响（注意力）权重
        #那么A[i][j]则代表第j个单词对第i的单词的影响（注意力）权重。如果你还不明白，以"I am so handsome"为例(矩阵数值是瞎编的)：
        #                                   I    am    so    handsome
        #                               I   3    4     -10    3
        #                               am  4    6     9      1
        #                               so  2    4     1      2
        #                         handsome  3    12    1      0
        #从这个图看出来am对so的影响权重为4。（A[2][1]）
        #所以后两维的意思懂了吧，那么前面的更好明白。num_attention_heads代表有num_attention_heads个这样的transformer，则就有num_attention_heads
        #个这样的权重矩阵了。batch_size则代表有多少个句子。
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        
        #这个就是对权重因子做一个简单的处理，至于为什么这样做，留给大家思考（其实我不懂……）
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        
        #下面又是一个知识点！！！得到的分数又加上了attention_mask，回忆下attention_mask的性质
        #维度：[batch_size,1,1,seq_length]
        #0代表有用的信息，-10000代表无用或者padding信息。
        #(to be completed)
        #相加，再做softmax，算出最后的权重。
        #那么，为什么是-10000呢？
        #以下是我个人拙见
        #我们不妨假设上例中的handsome是padding是填充的，那么这个handsome就是无用信息，attention_mask = [0,0,0,-10000]
        #加到上面的矩阵后，我们发现最后一列都变得很小，分别是-9997，-9999，-9998，-10000，其他三列加的是0，所以值不变。
        #然后用相加的值做softmax（不懂softmax的赶紧度娘），以第一行为例，第一行是（3，4，-10，-9997）
        #然后softmax之后，e的-9997次方接近0，这样handsome对I的影响不就接近为0了吗？而handsome又是padding的，本来
        #对其他单词的就没有什么影响，所以-10000的含义是为了消除padding单词对其他单词的影响！！！！
        attention_scores = attention_scores + attention_mask
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        
        
        #下面这句是官方吐槽……
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)
        
        
        #得到的结果（权重）再乘以value，就是最后的注意力了！
        #格式变为：[batch_size, num_attention_heads, seq_length, attention_head_size]
        context_layer = torch.matmul(attention_probs, value_layer)
        #下面的三行就是将[batch_size, num_attention_heads, seq_length, attention_head_size]格式转化为
        #[batch_size, seq_leagth, all_head_size],又回到了最初的起点……
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        return context_layer
#.....................................................................
 
 
 
#.....................................................................
#如果你看到了这里，那么恭喜你已经看完了bert的核心代码，下面是其他的一些处理。
#从selfattention出来之后，又进入了一个叫BertSelfOutput的层，这个层就非常简单了，主要做了3件事
#1、全连接层 2、dropout层 3、layernormer层
#注意这个layernormer是跟着输入加在一起做的。
#最后输出的维度是[batch_size, seq_leagth, hidden_size=768]
def forward(self, hidden_states, input_tensor):
    hidden_states = self.dense(hidden_states)
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.LayerNorm(hidden_states + input_tensor)
    return hidden_states
#over!!!!!!!!!!
#..................................................................

注意，此时仍然身陷在BertLayer层之中，只是从BertLayer层的Attention层逃蹿出来。紧接着便进入BertLayer的第二层：intermediate

当然，这层就非常简单了啦~

#.......................................................................
#从attention出来之后，又进入了一个叫BertIntermediate的层，这个层就非常简单了，主要做了俩件事
#1、一个全连接层 2、一个激活层
#具体地，输入为[batch_size, seq_length, hidden_size = 768]
    
    def forward(self, hidden_states):
        #[batch_size, seq_length, all_head_size = 768] * [hidden_size, intermediate_size = 4*768](论文和代码都是这样的设置的)
        hidden_states = self.dense(hidden_states)
        #下面是激活函数，具体的选取看该类的init方法。
        hidden_states = self.intermediate_act_fn(hidden_states)
        #然后返他妈的回，形状变成了[batch_size, seq_length,intermediate_size=4*768]
        return hidden_states
#.....................................................................

从所谓的intermediate层出来之后，就进入了BertLayer的最后一层BertOutput层，这个和之前的BertSelfOutput层几乎一模一样，只是参数不同，不详细解释了

#输入形状[batch_size, seq_length,intermediate_size=4*768]
#输出是[batch_size, seq_length,hidden_size=768]
class BertOutput(nn.Module):
    def __init__(self, config):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
 
    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

第三步：整理输出

#此时，我们终于可以将视线再次转回到BertModel模块了。
#回忆一下这句代码，我们得到的输出和output_all_encoded_layes相关
#如果output_all_encoded_layes==True，我们得到所有层encoder的输出
#如果output_all_encoded_layes==False，我们得到最后一层encoder的输出
encoded_layers = self.encoder(embedding_output, extended_attention_mask,output_all_encoded_layers=output_all_encoded_layers)
#
 
#取出最后一层的输出
sequence_output = encoded_layers[-1]
#最后一层的输出经过pooler层，得到pooled_output，那么这个pooled_output有啥用呢？下面讲pooler层时会说明
pooled_output = self.pooler(sequence_output)
if not output_all_encoded_layers:
    encoded_layers = encoded_layers[-1]
return encoded_layers, pooled_output

BertPooler层详细解释

#由上面的讲解可知，pooler层的输入是transformer最后一层的输出，[batch_size, seq_length, hidden_size]
def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        
        #取出每一句的第一个单词，做全连接和激活。得到的输出可以用来分类等下游任务（即将每个句子的第一个单词的表示作为整个句子的表示）
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

结束

小樊努力努力再努力

关注

22
点赞
踩
77

收藏

觉得还不错? 一键收藏
8
评论
Bert代码详解及模型介绍

前言写在前边：很多东西我也仅仅是一知半解，仅仅记录学习过程，个人观点，还需要看大量的代码。必须要耐住性子看代码，别无他法，看的多了自然就会了。因为个人不用tensorflow，so这是bert的pytorch版本，地址：https://github.com/huggingface/pytorch-pretrained-BERT 主要内容在pytorch_pretrained_bert/modeling文件中。、BertModel 主要为 transformer encoder 结构，包含三个部分
复制链接

扫一扫