pre-train涉及三个模块
- tokenization.py
- create_pretraining_data.py
- run_pretraining.py
其中tokenization是对原始句子内容的解析,分为BasicTokenizer和WordpieceTokenizer和Fulltokenizer三种,其中FullTokenizer是前两种方法的结合,不只在预训练中,在fine-tune和推断过程同样要用到它;create_pretraining_data顾名思义就是将原始语料转换成适合模型预训练的输入数据;run_pretraining就是预训练的执行代码。
一、tokenization.py
Tokenization模块中,除了BasicTokenizer和WordpieceTokenizer和Fulltokenizer三种主要的类外,还包括下面这些函数
- convert_to_unicode:将文本转换为Unicode(如果尚未转换),假定为utf-8输入。
- load_vocab:将单词文本转化为字典。
- convert_by_vocab:将通过load_vocab函数得到的vocab转化为[Token|ids]的序列。
- whitespace_tokenize:对一段文本进行基本的空白和拆分处理。
- _is_whitespace:检查字符是否是空白字符
- _is_control:检查字符是否是控制字符
- _is_punctuation:检查字符是否是标点符号
- _clean_text:移除文本中无效字符和空白字符
下面让我们看看Tokenization的三个主要方法:
1、BasicTokenizer
BasicTokenizer的主要做的就是进行unicode转换、标点符号分割、小写转换、中文字符分割、去除重音,无效字符等操作,返回的是关于词的数组(中文是字的数组)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
self.do_lower_case = do_lower_case #是否进行小写转化
#tokenize是主要的方法函数
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text) #编码转化(英文文本通常不需要处理)
text = self._clean_text(text) #清洗无效字符和空白字符
text = self._tokenize_chinese_chars(text) #对中文字符进行处理,分割
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token) #去除重音
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""去除重音"""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
#通过标点符号将文本分开
#如:“I love you, china”分割成['I love you',',','china']
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
2、WordpieceTokenizer
将一段文本分成单词,利用所给的词汇表,使用贪婪的最长匹配优先算法执行Tokenization.
For example:
input = “unaffable”
output = [“un”, “##aff”, “##able”]
这么做的目的是防止因为词的过于生僻没有被收录进词典最后只能以[UNK]代替的局面,因为英语当中这样的合成词非常多,词典不可能全部收录。
输入的文本:单个标记或由空格分隔的标记,且应该通过BasicTkenizer调用过。
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
3、FullTokenizer
FullTokenizer的作用就很显而易见了,对一个文本段进行以上两种解析,最后返回词(字)的数组,同时还提供token到id的索引以及id到token的索引。这里的token可以理解为文本段处理过后的最小单元。
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
二、create_pretraining_data.py
1、配置
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string("input_file", None,
"Input raw text file (or comma-separated list of files).")
flags.DEFINE_string(
"output_file", None,
"Output TF example file (or comma-separated list of files).")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool(
"do_whole_word_mask", False,
"Whether to use whole word masking rather than per-WordPiece masking.")
flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")
flags.DEFINE_integer("max_predictions_per_seq", 20,
"Maximum number of masked LM predictions per sequence.")
flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
flags.DEFINE_integer(
"dupe_factor", 10,
"Number of times to duplicate the input data (with different masks).")
flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")
flags.DEFINE_float(
"short_seq_prob", 0.1,
"Probability of creating sequences which are shorter than the "
"maximum length.")
参数:
- input_file:指定输入文档路径
- output_file:指定输出路径
- voca_file:指定词典路径(谷歌在预训练模型中提供)
- do_lower_case:为True则忽略大小写
- do_whole_word_mask:为False则是使用遮蔽整个单词,而不是wordpiece,wordpiece就是我们之前利用WordpieceTokenizer得到的词。
- max_seq_length:每一条训练数据(两句话)相加后的最大长度限制
- max_predictions_per_seq:每条训练数据遮蔽的最大数量
- random_seed:随机种子
- dupe_factor:对文档多次重复随机产生训练集,随机的次数
- maxked_lm_prob:一条训练数据产生mask的概率,即每条训练数据随机产生
- short_seq_prob:为了缩小预训练和微调过程的差距,以此概率产生小于max_seq_length的训练数据
2、main入口
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
input_files = []
for input_pattern in FLAGS.input_file.split(","):
input_files.extend(tf.gfile.Glob(input_pattern))
tf.logging.info("*** Reading from input files ***")
for input_file in input_files:
tf.logging.info(" %s", input_file)
rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
rng)
output_files = FLAGS.output_file.split(",")
tf.logging.info("*** Writing to output files ***")
for output_file in output_files:
tf.logging.info(" %s", output_file)
从入口看步骤很简单:1:构造Tokennizer;2:构造instances;3:保存instance
构造Tokenizer在上一节说过了,下面我们看看如何构造instances;
3、构造instances
这一步是从原始文本中读取数据;输入的文本可以是一个文件也可以是用逗号分隔的若干文件;文件里用换行表示句子的边界,即一句一行,同理段落之间用空一行来段落的边界,一个段落表示成一个document;每个段落具体的构造方法在create_instances_from_document函数里面。
def create_training_instances(input_files, tokenizer, max_seq_length,
dupe_factor, short_seq_prob, masked_lm_prob,
max_predictions_per_seq, rng):
"""Create `TrainingInstance`s from raw text."""
all_documents = [[]]
# Input file format:
# (1) One sentence per line. These should ideally be actual sentences, not
# entire paragraphs or arbitrary spans of text. (Because we use the
# sentence boundaries for the "next sentence prediction" task).
# (2) Blank lines between documents. Document boundaries are needed so
# that the "next sentence prediction" task doesn't span between documents.
for input_file in input_files:
with tf.gfile.GFile(input_file, "r") as reader:
while True:
line = tokenization.convert_to_unicode(reader.readline())
if not line:
break
line = line.strip()
# Empty lines are used as document delimiters
#空行作为段落定界符,遇到空行就在all_documents里面增加一个空列表;
if not line:
all_documents.append([])
tokens = tokenizer.tokenize(line)
if tokens:
all_documents[-1].append(tokens)
# Remove empty documents
all_documents = [x for x in all_documents if x]
rng.shuffle(all_documents)
vocab_words = list(tokenizer.vocab.keys())
instances = []
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(#************核心函数*************
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
rng.shuffle(instances)
return instances
下面是函数create_instances_from_document是整个模块的核心,为单个document(段落)的文本构造训练实例,转换成我们模型所需要的输入形式。
我们的目标就是在我们之前划分出的每个段落中构造训练instances
我们拿英文举例:原始段落中的语句:[the man went to the store, he bought a gallon milk,…]
构造成:
Input = [CLS] the man went to [mask] store **[SEP]**he bought a gallon milk [SEP],
label = IsNext
Input = [CLS] the man went to [mask] store [SEP] he like you**[SEP]**
label = NotNext
其中,由于我们有达到指定的序列长度,构造实例不一定是两句话,有可能是连续的几句话,我们的目的就是为了后面两个训练,1)Masked LM 让模型预测我们遮蔽的单词;2)Next Sentence Prediction 让模型预测训练实例是否是两个连续一段话;服务。前者就是一个多分类,后者是一个二分类。
其中还有一些细节,比如将构造的一对序列截断为我们指定的最大实例;如何随机构造一个序列的假的下一句话;如何将序列中的单词替换成**[mask]**等,具体看源码吧
def create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
"""Creates `TrainingInstance`s for a single document."""
document = all_documents[document_index]
# Account for [CLS], [SEP], [SEP]
max_num_tokens = max_seq_length - 3
target_seq_length = max_num_tokens
if rng.random() < short_seq_prob:
#产生的随机数小于short_seq_prob,则目标序列的长度为(2,max_num_token)之间一个随机数
target_seq_length = rng.randint(2, max_num_tokens)
instances = []
current_chunk = []
current_length = 0
i = 0
while i < len(document):
segment = document[i]
current_chunk.append(segment)
current_length += len(segment)
if i == len(document) - 1 or current_length >= target_seq_length:
if current_chunk:
# `a_end` is how many segments from `current_chunk` go into the `A`
# (first) sentence.
a_end = 1
if len(current_chunk) >= 2:
a_end = rng.randint(1, len(current_chunk) - 1)
tokens_a = []
for j in range(a_end):
tokens_a.extend(current_chunk[j])
tokens_b = []
# Random next
is_random_next = False
if len(current_chunk) == 1 or rng.random() < 0.5:
is_random_next = True
target_b_length = target_seq_length - len(tokens_a)
# This should rarely go for more than one iteration for large
# corpora. However, just to be careful, we try to make sure that
# the random document is not the same as the document
# we're processing.
for _ in range(10):
random_document_index = rng.randint(0, len(all_documents) - 1)
if random_document_index != document_index:
break
random_document = all_documents[random_document_index]
random_start = rng.randint(0, len(random_document) - 1)
for j in range(random_start, len(random_document)):
tokens_b.extend(random_document[j])
if len(tokens_b) >= target_b_length:
break
# We didn't actually use these segments so we "put them back" so
# they don't go to waste.
num_unused_segments = len(current_chunk) - a_end
i -= num_unused_segments
# Actual next
else:
is_random_next = False
for j in range(a_end, len(current_chunk)):
tokens_b.extend(current_chunk[j])
truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
assert len(tokens_a) >= 1
assert len(tokens_b) >= 1
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
(tokens, masked_lm_positions,
masked_lm_labels) = create_masked_lm_predictions(
tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance(
tokens=tokens,
segment_ids=segment_ids,
is_random_next=is_random_next,
masked_lm_positions=masked_lm_positions,
masked_lm_labels=masked_lm_labels)
instances.append(instance)
current_chunk = []
current_length = 0
i += 1
return instances
这个函数是为Masked LM提供训练的实例,
这部分对token进行随机mask。这部分是BERT的创新点之二,随机遮蔽。为了防止双向模型在多层之后“看到自己”。这里对一部分词进行随机遮蔽,并在预训练中进行预测。遮蔽方案:
1.以80%的概率直接变成[MASK]
2.以10%的概率保留原词
3.以10%的概率在词典中随机找一个词替代
返回值:经过随机遮蔽后的(词,遮蔽位置,遮蔽前原词)
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""
cand_indexes = []
for (i, token) in enumerate(tokens):
if token == "[CLS]" or token == "[SEP]":
continue
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
token.startswith("##")):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
rng.shuffle(cand_indexes)
output_tokens = list(tokens)
num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))
masked_lms = []
covered_indexes = set()
for index_set in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
output_tokens[index] = masked_token
masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
masked_lms = sorted(masked_lms, key=lambda x: x.index)
masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)
return (output_tokens, masked_lm_positions, masked_lm_labels)
最终的一个instance包含一个tokens,实际就是输入的词序列;该序列表现形式为:
[CLS] A [SEP] B [SEP]
A=[token_0, token_1, …,token_i]
B=[token_i+1, token_i+2, …,token_n-1]
其中:
2<= n < max_seq_length - 3 (in short_seq_prob)
n=max_seq_length - 3 (in 1-short_seq_prob)
Token最后的表现形式如下图所示:
segment_ids 指的形式为[0,0,0…1,1,111] 0的个数为i+1个,1的个数为max_seq_length - (i+1)
对应到模型输入就是token_type
is_random_next:其实就是上图的Label,0.5的概率为True(和当只有一个segment的时候),如果为True则B和A不属于同一document。剩下的情况为False,则B为A同一document的后续句子。
masked_lm_positions:序列里被[MASK]的位置;
masked_lm_labels:序列里被[MASK]的token
4、保存instance
def write_instance_to_example_files(instances, tokenizer, max_seq_length,
max_predictions_per_seq, output_files):
"""Create TF example files from `TrainingInstance`s."""
writers = []
for output_file in output_files:
writers.append(tf.python_io.TFRecordWriter(output_file))
writer_index = 0
total_written = 0
for (inst_index, instance) in enumerate(instances):
input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
input_mask = [1] * len(input_ids)
segment_ids = list(instance.segment_ids)
assert len(input_ids) <= max_seq_length
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
masked_lm_positions = list(instance.masked_lm_positions)
masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
masked_lm_weights = [1.0] * len(masked_lm_ids)
while len(masked_lm_positions) < max_predictions_per_seq:
masked_lm_positions.append(0)
masked_lm_ids.append(0)
masked_lm_weights.append(0.0)
next_sentence_label = 1 if instance.is_random_next else 0
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(input_ids)
features["input_mask"] = create_int_feature(input_mask)
features["segment_ids"] = create_int_feature(segment_ids)
features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
features["next_sentence_labels"] = create_int_feature([next_sentence_label])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writers[writer_index].write(tf_example.SerializeToString())
writer_index = (writer_index + 1) % len(writers)
total_written += 1
if inst_index < 20:
tf.logging.info("*** Example ***")
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in instance.tokens]))
for feature_name in features.keys():
feature = features[feature_name]
values = []
if feature.int64_list.value:
values = feature.int64_list.value
elif feature.float_list.value:
values = feature.float_list.value
tf.logging.info(
"%s: %s" % (feature_name, " ".join([str(x) for x in values])))
for writer in writers:
writer.close()
tf.logging.info("Wrote %d total instances", total_written)
保存实例里面有两点需要注意以外,其他没什么特别的东西
1、之前不是有short_seq_prob的概率导致样本的长度小于max_predictions_per_seq吗,这里把这些样本补齐,padding为0,同样的还有input_mask和segment_ids;
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
2、把instance的is_random_next转化成变量next_sentence_label保存。
三、run_pretraining.py
预训练的执行模块,里面大部分是tensorflow训练的常规代码,
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
masked_lm_positions = features["masked_lm_positions"]
masked_lm_ids = features["masked_lm_ids"]
masked_lm_weights = features["masked_lm_weights"]
next_sentence_labels = features["next_sentence_labels"]
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
其中input_ids、input_mask 、segment_ids 作为X,剩下的masked_lm_positions、masked_lm_ids 、masked_lm_weights 、next_sentence_labels 共同作为Y
(masked_lm_loss,
masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
bert_config, model.get_sequence_output(), model.get_embedding_table(),
masked_lm_positions, masked_lm_ids, masked_lm_weights)
(next_sentence_loss, next_sentence_example_loss,
next_sentence_log_probs) = get_next_sentence_output(
bert_config, model.get_pooled_output(), next_sentence_labels)
total_loss = masked_lm_loss + next_sentence_loss
可以看到loss 分别由masked_lm_loss和next_sentence_loss组成,masked_lm_loss针对的是语言模型对MASK起来的标签的预测,即上下文语境预测当前词;而next_sentence_loss是对于句子关系的预测。前者在迁移学习中可以用于标注类任务(分词、NER等),后者可以用于句子关系任务(QA、自然语言推理等)。
需要多说一句的是,masked_lm_loss,用到了模型的sequence_output和embedding_table,这是因为对多个MASK的标签进行预测是一个标注问题,所以需要获取最后一层的整个sequence,而embedding_table用来反embedding,这样就映射到token的学习了。而next_sentence_loss用到的是pooled_output,对应的是第一个token [CLS],它一般用于分类任务的学习。
总结
本文介绍了以下几个内容:
1、tokenization模块:我把它叫做对原始文本段的解析,只有解析过后才能标准化输入;
2、create_pretraining_data模块:对原始数据进行转换,原始数据本是无标签的数据,通过句子的拼接可以产生句子关系的标签,通过MASK可以产生标注的标签,其本质是语言模型的应用;
3、run_pretraining模块:在执行预训练的时候针对以上两种标签分别利用bert模型的不同输出部件,计算loss,然后进行梯度下降优化。
Reference
1.https://github.com/google-research/bert
2.BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
3.西溪雷神https://www.jianshu.com/p/22e462f01d8c