生物大分子平台(12)
2021SC@SDUSC
文章目录
0 本周工作
本周学习transformers库中的代码,其中重点学习了bert的代码部分。
1 代码解读
Transformers库提供了数以千计的预训练模型,支持 100 多种语言的文本分类、信息抽取、问答、摘要、翻译、文本生成。提供了便于快速下载和使用的API,让你可以把预训练模型用在给定文本、在你的数据集上微调然后通过 model hub 与社区共享。支持pytorch,可以让我们更好地进行使用。
1.1 与管道模型一起使用
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
'score': 0.1073106899857521,
'token': 4827,
'token_str': 'fashion'},
{'sequence': "[CLS] hello i'm a role model. [SEP]",
'score': 0.08774490654468536,
'token': 2535,
'token_str': 'role'},
{'sequence': "[CLS] hello i'm a new model. [SEP]",
'score': 0.05338378623127937,
'token': 2047,
'token_str': 'new'},
{'sequence': "[CLS] hello i'm a super model. [SEP]",
'score': 0.04667217284440994,
'token': 3565,
'token_str': 'super'},
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
'score': 0.027095865458250046,
'token': 2986,
'token_str': 'fine'}]
1.2 pyTorch代码部分调用
调用相关库,计算tokenizer,okenizer负责预处理文本, 首先会将文本分词(或比词更细的粒度subwords,标点符号…),将分词后的每个个体叫做tokens。然后会转化成一个一个id通过一个查询表(look-up table)
定义预训练模型,输入训练文本text,对文本进行嵌入,使用模型进行输出。
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
1.3 BertForMaskedLM类
Masked LM和下一句预测两个无监督预测任务用来对BERT进行预训练。
BertForMaskedLM将基于BertModel得到的序列编码,利用Masked LM预训练任务进行预训练。在句子中随机用 [MASK]遮盖一部分词语,然后使用模型对遮盖部分的词语进行预测。模型的前向传递过程:首先通过bertmodel获得句子中每个词的编码sequence_output ;prediction_scores 为词典中每一个词的预测概率定义交叉熵损失loss_fct ,prediction_scores.view(-1, self.config.vocab_size)将prediction_scores根据词典大小调整为二维矩阵,labels.view(-1)将标签转化为一维张量。然后对 此标量进行操作。
1.4 bert过程解读
- transformer_model 的输入是 embedding_output([batch_size, max_sequence_length, hidden_size])
- 为了减少 representation 在 2D 和 3D 之间的变换过程,所以在处理过程中保持 2D 的状态
- reshape embedding_output 为prev_output([batch_size*max_sequence_length, hidden_size])
- 对其进行 num_hidden_layers 个 transformer block 得到 all_encoder_layers(num_hidden_layers*[batch, max_sequence_length, hidden_size])。
- 每个 block 分为两个小层,并且把上个 block 的输出 prev_output([batch_sizemax_sequence_length, hidden_size]) 作为 layer_input([batchmax_sequence_length, hidden_size])
- 先经过 attention_layer 得到 attention_head,再把如果有多个 attention_heads 再把他们 concat 起来得到 attention_output(在 bert 的情况下看起来只会有一个 attention_heads,可能在一般情况下会有其他序列的 attention)
- 把 attention_output 用一个全链接投影到 hidden_size 维上,加上 dropout 之后,和 layer_input ([batch*max_sequence_length, hidden_size]) 相加(相当于一个 shortcut),
- 最后进行 layer_norm,第二小层是过一个激活函为 gelu 的全链接,得到intermediate_output ([batch*max_sequence_length, intermediate_size])
- 再投影回 layer_output ([batchmax_sequence_length, hidden_size]) ,dropout 后再加上 attention_output ([batchmax_sequence_length, hidden_size]) (相当于一个 shortcut),最后进行 layer_norm 得到 block 的输出。
1.6 bertEmbeddings和bertSelfattention代码
- adam_v 和 adam_m 是 AdamWeightDecayOptimizer 中用于计算 m 和 v 的变量,使用预训练模型不需要
- 从词、位置和 token_type 嵌入构建嵌入。
- self.LayerNorm 不是蛇形的,坚持使用 TensorFlow 模型变量名称并能够加载任何 TensorFlow 检查点文件
- position_ids(1, len position emb) 在内存中是连续的,序列化时导出
- 将 token_type_ids 设置为构造函数中注册的缓冲区,其中它全为零,这通常发生在其自动生成的注册缓冲区帮助用户在不传递 token_type_ids 的情况下跟踪模型时解决问题
- 如果将其实例化为交叉注意模块,则键和值来自编码器;注意掩码需要使编码器的填充标记不被注意
class BertEmbeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
if version.parse(torch.__version__) > version.parse("1.6.0"):
self.register_buffer(
"token_type_ids",
torch.zeros(self.position_ids.size(), dtype=torch.long),
persistent=False,
)
def forward(
self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
):
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
if token_type_ids is None:
if hasattr(self, "token_type_ids"):
buffered_token_type_ids = self.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class BertSelfAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = position_embedding_type or getattr(
config, "position_embedding_type", "absolute"
)
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.is_decoder = config.is_decoder
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
mixed_query_layer = self.query(hidden_states)
is_cross_attention = encoder_hidden_states is not None
if is_cross_attention and past_key_value is not None:
# reuse k,v, cross_attentions
key_layer = past_key_value[0]
value_layer = past_key_value[1]
attention_mask = encoder_attention_mask
elif is_cross_attention:
key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
if self.is_decoder:
past_key_value = (key_layer, value_layer)
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
seq_length = hidden_states.size()[1]
position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
attention_scores = attention_scores + attention_mask
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
attention_probs = self.dropout(attention_probs)
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
if self.is_decoder:
outputs = outputs + (past_key_value,)
return outputs