【项目实训11】基于PyTorch的BERT

SophoraeT_t

已于 2024-06-24 11:20:54 修改

阅读量794

点赞数 7

分类专栏：裁判文书项目文章标签： pytorch bert 人工智能

于 2024-06-22 17:00:00 首次发布

本文链接：https://blog.csdn.net/SophoraeT_t/article/details/139874057

版权

裁判文书项目专栏收录该内容

12 篇文章 0 订阅

订阅专栏

BERT（Bidirectional Encoder Representations from Transformers）模型是近年来自然语言处理（NLP）领域的重要突破之一。它通过预训练和双向Transformer架构，实现了显著的性能提升。理解BERT的底层实现对于进一步研究和应用至关重要。本文将基于PyTorch的源码，深入解析BERT模型的核心组件和实现细节。

BERT模型概述

BERT模型的核心思想在于其双向性和预训练方法。通过在大规模无标注文本数据上进行预训练，BERT能够学习到通用的语言表示，然后在特定任务上进行微调，从而达到优异的性能。

BERT的基本组件

BERT模型主要由以下几个组件构成：

嵌入层（Embedding Layer）
多层双向Transformer编码器（Multi-layer Bidirectional Transformer Encoder）
预训练任务（Pre-training Tasks）

嵌入层

嵌入层将输入的文本序列转换为可供Transformer处理的嵌入向量。BERT的嵌入层包括词嵌入、位置嵌入和段嵌入。

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)

        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

在这个实现中：

word_embeddings：词嵌入层，将每个词的ID映射到相应的嵌入向量。
position_embeddings：位置嵌入层，注入位置信息。
token_type_embeddings：段嵌入层，用于区分不同的段。
forward方法：计算输入的嵌入表示，并应用层归一化和dropout。

多层双向Transformer编码器

BERT的核心在于多层双向Transformer编码器，每一层包括一个多头自注意力机制和一个前馈神经网络。

class BertLayer(nn.Module):
    def __init__(self, config):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(self, hidden_states, attention_mask):
        attention_output = self.attention(hidden_states, attention_mask)
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output

在这个实现中：

attention：多头自注意力机制。
intermediate：前馈神经网络的中间层。
output：前馈神经网络的输出层。
forward方法：通过注意力机制和前馈网络处理输入。

多头自注意力机制

class BertAttention(nn.Module):
    def __init__(self, config):
        super(BertAttention, self).__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)

    def forward(self, input_tensor, attention_mask):
        self_output = self.self(input_tensor, attention_mask)
        attention_output = self.output(self_output, input_tensor)
        return attention_output

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super(BertSelfAttention, self).__init__()
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        attention_scores = attention_scores + attention_mask

        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)

        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        return context_layer

在这个实现中：

query、key和value：三个线性层，用于生成查询、键和值。
transpose_for_scores：调整张量形状以计算注意力分数。
forward方法：计算注意力分数，并根据这些分数对值进行加权求和，生成上下文层。

预训练任务

BERT的预训练包含两个关键任务：Masked Language Model（MLM）和Next Sentence Prediction（NSP）。

Masked Language Model

MLM任务在输入序列中随机遮蔽一些单词，并让模型预测这些被遮蔽的单词。

class BertForMaskedLM(nn.Module):
    def __init__(self, config):
        super(BertForMaskedLM, self).__init__()
        self.bert = BertModel(config)
        self.cls = BertOnlyMLMHead(config)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None):
        outputs = self.bert(input_ids, token_type_ids, attention_mask)
        sequence_output = outputs[0]
        prediction_scores = self.cls(sequence_output)
        return prediction_scores

Next Sentence Prediction

NSP任务用于判断两个句子是否是连续的。

class BertForNextSentencePrediction(nn.Module):
    def __init__(self, config):
        super(BertForNextSentencePrediction, self).__init__()
        self.bert = BertModel(config)
        self.cls = BertOnlyNSPHead(config)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, next_sentence_label=None):
        outputs = self.bert(input_ids, token_type_ids, attention_mask)
        pooled_output = outputs[1]
        seq_relationship_score = self.cls(pooled_output)
        return seq_relationship_score