前置:word2vec,RNN网络模型,了解词向量如何建模
重点:Transformer网络架构,BERT训练方法,实际应用
基本组成依旧是机器翻译模型中常见的Seq2Seq网络
传统RNN的问题:
下一层需要上一层的输出,不能并行。
Transformer:
self-attention机制来进行并行计算,在输入和输出都相同。输出结果是被同时计算出来的,基本已经取代RNN了。
考虑词将上下文语境融入到词向量中。
两个词x1和x2:
第一步向量初始化,转化为编码(四维向量,四个特征)
第二步Q矩阵,K矩阵,V矩阵,借助三个辅助矩阵求出
求得当前词和每一个词的分值。
softmax求得:当前词对待编码位置的影响大小。
向量维度越大值越大但影响不一定越重要,所以要去掉向量维度影响。
每个词汇跟整个序列中每一个K计算得分,然后基于得分再分配特征,得到注意力值。
总过程:
多头注意力机制
一组qkv可以提取一组当前词的特征表达,多组qkv提取多组特征。
经过自注意力后要对层做归一化。
decoder:输入输出都是一个序列,
模型:BERT_BASE_DIR
数据:glue_data
选MRPC——两个字符串描述的是否同一个意思
文件:
BERT_BASE_DIR/uncased…/
bert_config.json 一些配置参数
ckpt:谷歌保存的预训练模型
vocab.txt:语料库,所有的词
run_classifier.py
设置run configuration
Arguments:
–task_name=MRPC \
–do_train=true \
–do_eval=true \
–data_dir=…/GLUE/glue_data/MRPC
–vocab_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_config.json \
–init_checkpoint=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=128 \
–train_batch_size=1 \
–learning_rate=2e-5 \
–num_train_epochs=3.0 \
–output_dir=…/GLUE/output
任务名,do_train做不做训练,do_eval做不做验证,windows别用绝对路径,别用中文
run_classifier.py:
如177-192:读取数据需要自己完成
842:train_examples = processor.get_train_examples(FLAGS.data_dir)
读取到一个数据(跳转到299)
num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
↑844行train_examples有3668个,batch_size=100,要做3700/100=37次迭代,乘上epoches=3,一共迭代111次
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
↑845行num_warmup_steps:刚开始训练的时候让学习率偏小,经过warmup阶段后再还原(设置的是0.1,就是经过111*0.1=11次迭代后学习率还原)
869行:数据读取
file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
在file_based_convert_examples_to_features中
writer = tf.python_io.TFRecordWriter(output_file)
↑483行,转化为TFRecordWriter格式
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
↑485行:每10000次打印一次结果
feature = convert_single_example(ex_index, example, label_list,max_seq_length, tokenizer)
↑489行:核心函数,跳转377行
label_map = {}
for (i, label) in enumerate(label_list): #构建标签
label_map[label] = i
389行:构建标签,0和1两个类别
tokens_a = tokenizer.tokenize(example.text_a) #第一句话分词
393行:分词,跳转tokenizer.py 170行
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
↑wordpiece
实例:
<class ‘list’>: [‘am’, ‘##ro’, ‘##zi’, ‘accused’, ‘his’, ‘brother’, ‘,’, ‘whom’, ‘he’, ‘called’, ‘"’, ‘the’, ‘witness’, ‘"’, ‘,’, ‘of’, ‘deliberately’, ‘di’, ‘##stor’, ‘##ting’, ‘his’, ‘evidence’, ‘.’]
中文基本都是切分成一个一个字来做,总体思路都是切分成更细的来处理
分词完成返回run_classifier,如果存在第二句话,则也进行分词
398行判断
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3" #保留3个特殊字符
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) #如果这俩太长了就截断操作
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
①过长截断
②有b保留三个特殊字符,没有b保留两个特殊字符
408开始进行编码,代码自带的说明:
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 #表示来自哪句话
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
type_id=0表示前一句话,1表示后一句话
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
426行开始编码,第一个词是CLS固定的,编码为0,然后遍历word_piece中每一个词,type_id都是0。a里的词都添加完后,加入连接符sep,再添加一个0
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
436行开始添加b(如果存在),type_id编码为1,通过vocab.txt的索引来找词
443行转化为ID(vocab.txt)
<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]
max_seq_length=128,不够的补0
while len(input_ids) < max_seq_length: #PAD的长度取决于设置的最大长度
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
↑450行,self-attention中考虑补0时要加入额外mask,做自注意力区分补的0,指定实际的词的input_mask=1,实际参与到自注意力计算中,补0的input_mask=0,不参与。
input_ids:<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_masks:<class ‘list’>: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
459行输出:
470行:inputfeatures->161行,自我初始化赋值
485行:for循环遍历每一个样本
496开始:处理样本
496-502:数据类型转化成tfRecorder
504,505
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
把一条tf数据序列化的写进writer
embedding层:
574层开始创建bert模型
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,#(8,128)
input_mask=input_mask,#(8,128)
token_type_ids=segment_ids,#(8,128)
use_one_hot_embeddings=use_one_hot_embeddings)
config:配置文件
is_training:是否训练
input_ids:batchsize都是8,128是每句话长度
input_mask:0还是1,是补的内容还是本来有内容
segment_ids:第几句话
modeling.py:
165行
if input_mask is None: #如果没设置mask 自然就都是1的
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
mask:没有mask就自动添加,都是1(对self-attention不好)
type_id:没说是几句话就默认是1句话,都设置成0
先构建embedding层,171行开始
把1-128都转化成一个向量,三个编码要维度相同
with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)
↑171,embedding过程,input_ids格式8x128,vocab_size三万多个(预训练模型),embedding_size:映射的多少维(官方是768),initializer_range:初始化取值范围(0.02),one_hot(默认false),预训练模型别改参数
额外编码特征
输入两个维度,(batchsize x max_length)=8x128
输出:batchsize x max_length x 768维的向量
modeling.py
171-180 :完成word_embedding
409行开始
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])
embedding_table = tf.get_variable( #词映射矩阵,30522, 768
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))
flat_input_ids = tf.reshape(input_ids, [-1])
if use_one_hot_embeddings:
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
output = tf.gather(embedding_table, flat_input_ids) #CPU,GPU运算1024, 768 一个batch里所有的映射结果
先给输入多加一个维度,8 x 128 x 1
拍平:flat_input_ids= input_ids(8x128x1)=1024
output=1024 x 768
input_shape = get_shape_list(input_ids)
output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) #(8, 128, 768)
return (output, embedding_table)
↑421行,output三个维度:8x128x768 batch_sizex每句话中的词x每个词的向量
词变成了向量
184-194:完成位置编码position_embedding
只是把信息融入,不会改变shape
跳转472-
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(#(2, 768)
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.
flat_token_type_ids = tf.reshape(token_type_ids, [-1])#(1024)
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width]) #8, 128, 768
output += token_type_embeddings
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
这里因为预设最多两句,所以embedding是(2,768),第一维度只有0和1两种
对每个词确定其是0还是1
这里的one_hot为了加速
进行乘法, token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
(1024x2) x(2 x 768),1024个词每个词有2个可能性,表格有两种可能性,每个可能性是768维的向量,最后也是1024x768
然后reshape成8x128x768
full_position_embeddings:512x768
505-507
position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1]) #位置编码给的挺大,为了加速只需要取出有用部分就可以 128, 768
num_dims = len(output.shape.as_list())
进行一个切片截取,512x768,返回的position_embeddings只对128x768进行处理
512-518行
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width]) # [1, 128, 768] 表示位置编码跟输入啥数据无关,因为原始的embedding是有batchsize当做第一个维度,这里为了计算也得加入
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
output += position_embeddings
现在得到的位置是128x768,对于每个batch都加一个相同的。位置编码与位置传进的词无关,加一个维度得到[1,128,768]
1.考虑词(2种可能性),2.位置(128个位置)
output = layer_norm_and_dropout(output, dropout_prob)
return output
加入dropout层,output相加输出(三个层的和)
mask机制:
modeling.py 200行
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)
基于每个词,去计算每个词需要跟多少个词做attention(跟1的做,0的忽略)
输入8x128,输出8x128x128,最后一个128是每一个单词能看见多少个单词
Transformer
modeling.py 205行
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
num_attention_heads=config.num_attention_heads,#多头注意力有多少个头
intermediate_size=config.intermediate_size,#全连接层神经元个数
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)#是否返回每一层的输出
input_tensor:前面embedding的结果
attention_mask:映射到0或者1,表示不要或者要这个词
fine-tuning接着训练,很多参数不能改
802行
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))
hidden_size=768
num_attention_heads=12
768/12,每个头64个特征,把每个头的向量拼在一起。如果不能整除后续计算会麻烦。
807行↓
attention_head_size = int(hidden_size / num_attention_heads) #一共要输出768个特征,给每个头分一下
input_shape = get_shape_list(input_tensor, expected_rank=3) # [8, 128, 768]
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]
815行↓
if input_width != hidden_size: #注意残差连接的方式,需要它俩维度一样才能相加
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))
这里不是拼接,是加法
残差连接,输入是768维输出必须也是768维才能进行相加,因此进行判断
reshape:对8x128转化为1024(可能是为了加速?)
819行
# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.
prev_output = reshape_to_matrix(input_tensor) #reshape的目的可能是为了加速
输入转化为1024x768
825行
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output
每一层的输出是下一层的输入,layer_input和prev_output都是128x768
830开始是attention
with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
attention_head = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)
from_tensor和to_tensor都是layer_input:对自己做attention
attention_mask:加1和0
返回2D的tensor,sql_length都是128
558行:构建attention_layer
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `
637行:
from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])#[1024, 768]
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])#[1024, 768]
# Scalar dimensions referenced here:
# B = batch size (number of sequences) 8
# F = `from_tensor` sequence length 128
# T = `to_tensor` sequence length 128
# N = `num_attention_heads` 12
# H = `size_per_head` 64
构建QKV矩阵:
Query:666行
# `query_layer` = [B*F, N*H]
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))
Query矩阵是由from_tensor构建的,num_attention_heads=12个头,size_per_head=64,query_layer=1024x768(BF,NH)
8个batch每个有128个词,1024个词都要跟其他词计算內积,对每个词都要产生矩阵,12个头,每个头有64特征,12x64=768
# `key_layer` = [B*T, N*H]
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key",
kernel_initializer=create_initializer(initializer_range))
key矩阵显然要跟Query矩阵维度一样,除了传进去的是to_tensor,其他参数与query矩阵一样
# `value_layer` = [B*T, N*H]
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value",
kernel_initializer=create_initializer(initializer_range))
实际得到特征(查看前面的QKV图),所以v跟k维度一样
內积计算:
# `query_layer` = [B, N, F, H] #为了加速计算内积
query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)
# `key_layer` = [B, N, T, H] #为了加速计算内积
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)
# Take the dot product between "query" and "key" to get the raw
# attention scores.
# `attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) #结果为(8, 12, 128, 128)
attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head))) #消除维度对结果的影响
根号dk:8,消除维度的影响
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
attention_mask = tf.expand_dims(attention_mask, axis=[1])
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 #mask为1的时候结果为0 mask为0的时候结果为非常大的负数
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
attention_scores += adder #把这个加入到原始的得分里相当于mask为1的就不变,mask为0的就会变成非常大的负数
有注意力mask的时候mask为1
mask为1时,adder=(1-1)-10000=0
mask为0时,adder=(1-0)-10000=-10000
softmax处理时,
x=-10000时,softmax近乎为0,基本没有分配概率,权重不会被映射到,
# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores) #再做softmax此时负数做softmax相当于结果为0了就相当于不考虑了
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
进行矩阵计算
# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])#(8, 128, 12, 64)
# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) #(8, 12, 128, 64)
# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)#计算最终结果特征 (8, 12, 128, 64)
# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])#转换回[8, 128, 12, 64]
857残差连接
# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"): #1024, 768 残差连接
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
attention_output = dropout(attention_output, hidden_dropout_prob)
attention_output = layer_norm(attention_output + layer_input)
全连接后做判断,884行
if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output
返回所有层或最后层。
创建模型
run_classifier577行
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,#(8,128)
input_mask=input_mask,#(8,128)
token_type_ids=segment_ids,#(8,128)
use_one_hot_embeddings=use_one_hot_embeddings)
590行定义输出
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value #768
output_weights = tf.get_variable( #再连的全连接层
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable( #偏置参数,0和1进行微调
"output_bias", [num_labels], initializer=tf.zeros_initializer())
get_pooled_output:第一位是CLS,会覆盖到所有的句子
hidden_size:768
output_weights:(2,768)
num_labels=2:二分类
modeling.py 205行最终结果
# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,#全连接层神经元个数
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)#是否返回每一层的输出
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained
first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))
first_token_tensor:第一个tensor就是CLS
经过bert_model得到一个模型,需要什么结果就连接怎样的全连接层
run_classifier 601行
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
logits=outputlayer * 权重 + 偏置,再加上softmax层,然后交叉熵计算损失
基本只要改数据读和预处理,在run_classifier177行左右
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
读取自己的数据集 run_classifier 199行
为了不改源码,text_b弄成乱码。
inputexample:指定的一个啥都没干的函数,把数据一个个传入examples中
class MyDataProcessor(DataProcessor):#自己写的方法,继承了DataProcessor
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\train_sentiment.txt')
f=open(file_path,'r',encoding='utf-8')
train_data=[]
index=0
for line in f.readlines():
guid = "train-%d" % (index)#指定一个id值
line=line.replace('\n','').split('\t') #换行符替换掉
text_a = tokenization.convert_to_unicode(str(line[1]))
label = str(line[2])
train_data.append(
InputExample(guid=guid, text_a=text_a, text_b=None,label=label))
index +=1
return train_data
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test_sentiment.txt')
f = open(file_path, 'r', encoding='utf-8')
dev_data = []
index = 0
for line in f.readlines():
guid = "dev-%d" % (index) # 指定一个id值
line = line.replace('\n', '').split('\t') # 换行符替换掉
text_a = tokenization.convert_to_unicode(str(line[1]))
label = str(line[2])
dev_data.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
index += 1
return dev_data
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test.csv')
test_df = open(file_path, 'r', encoding='utf-8')
test_data = []
for index ,test in enumerate(test_df.values):
guid = "test-%d" % (index)
text_a = tokenization.convert_to_unicode(str(test[0]))
label = str(test[1])
test_data.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
index += 1
return test_data
def get_labels(self):
"""Gets the list of labels for this data set."""
return ['0','1','2']
将自己写的预处理设置为参数
processors,加一行"mydata":MyDataProcessor
运行参数:
–data_dir=data
–task_name=mydata
–vocab_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/vocab.txt \
–bert_config_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json \
–output_dir=…/mydata_model
–do_train=true
–do_eval=true
–init_checkpoint=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_model.ckpt \
–max_seq_length=70 \
–train_batch_size=32 \
–learning_rate=5e-5 \
–num_train_epochs=3.0 \