预训练过程训练数据分析1

最新推荐文章于 2024-06-10 10:39:48 发布

唐僧爱吃唐僧肉

最新推荐文章于 2024-06-10 10:39:48 发布

阅读量1.7k

点赞数

分类专栏： bert源码解读

本文链接：https://blog.csdn.net/znevegiveup1/article/details/118527599

版权

bert源码解读专栏收录该内容

51 篇文章 7 订阅

订阅专栏

源代码地址
大佬预训练代码地址
模型结构之前已经进行分析过了，这里从训练过程开始分析

training_args = TrainingArguments(
    output_dir='record',
    num_train_epochs=num_train_epochs,
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    save_steps=save_steps,
    logging_steps=500,
    save_total_limit=5,
    prediction_loss_only=True,
    seed=seed
)

得到对应的training_args

training_args = 
output_dir:record
overwrite_output_dir:False
do_train:False
do_eval:False
do_predict:False
evaluation_strategy:IntervalStrategy.NO
prediction_loss_only:True
per_device_train_batch_size:32
per_device_eval_batch_size:8
per_gpu_train_batch_size:None
per_gpu_eval_batch_size:None
gradient_accumulation_steps:1

接着定义Trainer的内容

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

然后调用trainer.train()进入训练的过程
这里首先调用data_collator.py之中的mask_tokens的内容
首先分析DataCollatorForLanguageModeling类之中的内容

class DataCollatorForLanguageModeling:
	tokenizer: PreTrainedTokenizerBase
	mlm: bool = True
	mlm_probability: float = 0.15
	pad_to_multiple_of: Optional[int] = None
	def __post_init__(self):
		if self.mlm and self.tokenizer.mask_token is None:
            raise ValueError(
                "This tokenizer does not have a mask token which is necessary for masked language modeling. "
                "You should pass `mlm=False` to train on causal language modeling instead."
            )

这是初始化之中的一些参数，之前初始化过了，这里直接调用DataCollatorForLanguageModeling之中的__call__函数

    def __call__(
        self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        if isinstance(examples[0], (dict, BatchEncoding)):
            batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
        else:
            batch = {"input_ids": _collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)}
        # If special token mask has been preprocessed, pop it from the dict.
        special_tokens_mask = batch.pop("special_tokens_mask", None)
        #special_tokens_mask = None
        if self.mlm:
            #special_tokens_mask = None
            batch["input_ids"], batch["labels"] = self.mask_tokens(
                batch["input_ids"], special_tokens_mask=special_tokens_mask
            )
        else:
            labels = batch["input_ids"].clone()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            batch["labels"] = labels
        return batch

这里对应的放入的内容为：

[{'input_ids': tensor([  101,   169,   107, 10539,   142,  8231,   107,   131,   107,   146,
                       8189,  8168,  9402,  8156,  8177,  8660,  8154,  8408,  8921,  8148,
                       ............
                        100,   100,   107,   171,   117,   169,   107, 10539,   107,   102])},
                       ............
{'input_ids': tensor([  101,   169,   107, 10539,   142,  8231,   107,   131,   107,  8424,
                       8161, 12540, 12675,  8154, 10696,  9647,  8168,  8849,  8139,  9355,
                       ............
                        100,  8123,   118,  8143,   100,   100,   100,   100,   100,   102])},
                       ............]

接着调用对应的pad函数

batch = self.tokenizer.pad(examples,return_tensors="pt,pad_to_multiple_of=self.pad_to_multiple_of)

这里发现上面的数据有很多100的对应数值，联想到之前数据分词的时候可能有所操作。
回看之前LineByLineTextDataset的分词部分操作的内容

batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)

对应初始化操作之中

tokenizer:PreTrainedTokenizer

所以这里的self.tokenizer.pad要去PreTrainedTokenizer中的pad函数之中去查看
进入到PreTrainedTokenizer之中，也就是说直接使用PreTrainedTokenizer(…)的内容去进行相应的分词操作。

真实的tokenizer定义在main()的对应定义之中

tokenizer = BertTokenizer.from_pretrained(vocab_file)

经过的相应输出之后，发现这里的内容仍然为PreTrainedTokenizer的内容

tokenizer = 
PreTrainedTokenizer(name_or_path=
'/home/xiaoguzai/数据/nezha-chinese-base/vocab.txt', vocab_size=21128, 
model_max_len=1000000000000000019884624838656, is_fast=False, 
padding_side='right', special_tokens={'unk_token': '[UNK]', 
'sep_token': '[SEP]', 'pad_token': '[PAD]', 
'cls_token': '[CLS]', 'mask_token': '[MASK]'})

所以这里调用的DataCollatorForWholeWordMask的对应函数内容
这里面调用函数的过程如下：

PreTrainedTokenizer tokenize
PreTrainedTokenizer split_on_tokens
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer convert_tokens_to_ids
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
......

下一波又进入了这种调用函数的运行状态之中：

PreTrainedTokenizer tokenize
PreTrainedTokenizer split_on_tokens
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer split_on_token
PreTrainedTokenizer convert_tokens_to_ids
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
PreTrainedTokenizer _convert_token_to_id_with_added_voc
......

这里有一个相应的疑点，就是直接调用的时候为什么就调用到了PreTrainedTokenizer tokenize函数之中了，猜想大概是因为在初始化过程之中绑定了tokenize函数，使得调用了对应的tokenize的函数内容。
这里的实现应该类似于建立模型之后使用model(input_ids)直接找到对应的输出内容。
进入PreTrainedTokenizer的tokenize函数之中去查看调用的过程：

def tokenize(self, text: TextInput, **kwargs) -> List[str]:

这里输入的text =

{"text_id": "e225b9fd36b8914f42c188fc92e8918f", "query": "河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元", "candidate": [{"text": "巩义市桐和街", "label": "不匹配"}, 
{"text": "桐和街依家小店", "label": "不匹配"}, {"text": "桐和街CHANG六LIULIU", "label": "不匹配"}, {"text": "桐和街佳乐钢琴", "label": "不匹配"}, 
{"text": "世博领秀城南门桐和街囍饭食堂", "label": "不匹配"}]}

接着运行对应的all_special_tokens_extended内容

all_special_tokens_extended = dict(
            (str(t), t) for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
        )

输出对应的all_special_tokens_extended

all_special_tokens_extended = {}

接下来调用prepare_for_tokenization函数内容

text, kwargs = self.prepare_for_tokenization(text,**kwargs)

得到的text和kwargs没有变化，kwargs = {}仍然不变。
接下来

if hasattr(self,"do_lower_case") and self.do_lower_case:
	......

这里由于都是中文，所以不会运行这一句
接下来调用split_on_token函数和split_on_tokens函数内容

no_split_token = self.unique_no_split_tokens
tokenized_text = split_on_tokens(no_split_token,text)

进入split_on_tokens之中
这里放入我修改过之后的split_on_tokens函数

def split_on_tokens(tok_list, text):
    print('PreTrainedTokenizer split_on_tokens')
    if not text.strip():
        return []
    if not tok_list:
        return self._tokenize(text)

    tokenized_text = []
    text_list = [text]
    for tok in tok_list:
        tokenized_text = []
        for sub_text in text_list:
            if sub_text not in self.unique_no_split_tokens:
                tokenized_text.extend(split_on_token(tok, sub_text))
            else:
                tokenized_text.append(sub_text)
        text_list = tokenized_text

    return list(
        itertools.chain.from_iterable(
            (
                self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
                for token in tokenized_text
            )
        )
    )

这中间的内容都没有改变，关键在于最后return这一部分的内容

return list(
   	   itertools.chain.from_iterable(
   	   (
       		self._tokenize(token) if token not in self.unique_no_split_tokens else [token] for token in tokenized_text
       )
   )
)

这里的

self.unique_no_split_tokens = ['[CLS]','[MASK]','[PAD]','[SEP]','[UNK]']

处理完之后，对应的字符串内容为

['{', '"', 'text', '_', 'id', '"', ':', '"', 'e2', '##25', '##b', '##9', '##f', '##d', '##36', '##b', '##89', '##14', '##f', '##42', '##c', '##18', '##8', '##fc', '##92', '##e', '##89', '##18', '##f', '"', ',', '"', 'q', '##ue', '##ry', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '6', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '3', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 
'"', ',', '"', 'can', '##di', '##da', '##te', '"', ':', '[', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', 
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', 'chang', '[UNK]', 'liu', '##li', '##u', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', 
'{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',',
'{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ']', '}']

可以看出，所有的中文汉字都被替换成为了标志’[UNK]'的内容
比如传入的字符串为

['{"text_id": "e225b9fd36b8914f42c188fc92e8918f", 
   "query": "河南省巩义市新华路街道办事处桐和街6号钢苑新区3号楼一单元", 
   "candidate": [{"text": "巩义市桐和街", "label": "不匹配"}, {"text": "桐和街依家小店", "label": "不匹配"}, 
   {"text": "桐和街chang六liuliu", "label": "不匹配"}, {"text": "桐和街佳乐钢琴", "label": "不匹配"}, 
   {"text": "世博领秀城南门桐和街囍饭食堂", "label": "不匹配"}]}']

得到对应的list数组内容为

PreTrainedTokenizer split_on_tokens
['{', '"', 'text', '_', 'id', '"', ':', '"', 'e2', '##25', '##b', '##9', '##f', '##d', '##36', '##b', '##89', '##14', '##f', '##42', '##c', '##18', '##8', '##fc', '##92', '##e', '##89', '##18', '##f', '"', ',', '"', 'q', '##ue', '##ry', '"', ':', '"', '[UNK]',
'[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '6', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '3', '[UNK]', '[UNK]', 
'[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'can', '##di', '##da', '##te', '"', ':', '[', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 
'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', 'chang', '[UNK]', 'liu', 
'##li', '##u', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', 
'"', '}', ',', '{', '"', 'text', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '"', ',', '"', 'lab', '##el', '"', ':', '"', '[UNK]', '[UNK]', '[UNK]', '"', '}', ']', '}']
ids = 
[169, 107, 10539, 142, 8231, 107, 131, 107, 12357, 8743, 8204, 8160, 8189, 8168, 9159, 8204, 9402, 8717, 8189, 9240, 8177, 8662, 8156, 9717, 9595, 8154, 9402, 8662, 8189, 107, 117, 
107, 159, 8803, 8449, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 127, 100, 100, 100, 100, 100, 124, 100, 100, 100, 100, 100, 107, 117, 
107, 9109, 9172, 8521, 8299, 107, 131, 138, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 
10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 11680, 100, 
12306, 8636, 8207, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 
131, 107, 100, 100, 100, 107, 171, 117, 169, 107, 10539, 107, 131, 107, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 107, 117, 107, 11441, 8472, 107, 131, 107, 100, 100, 100, 107, 171, 140, 171]

从上面的内容可以看出，中文的内容全被使用[UNK]遮盖住了，而英文的内容则被保留了下来。
接下来查看训练数据的过程

trainer = Trainer(
	model=model,
	args=training_args,
	data_collator=data_collator,
	train_dataset=dataset
)

这里面首先调用的是DataCollatorForLanguageModeling类的调用__call__函数的过程

def __call__(
    self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
    print('data/data_collator.py __call__')
    # Handle dict or lists with proper padding and conversion to tensor.
    print('data_collator examples = ')
    print(examples)
    print('#########################')
    if isinstance(examples[0], (dict, BatchEncoding)):
        #运行第一个对应的if语句
        batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
        #这里的tokenizer使用的是PreTrainedTokenizerBase
        print('|||self.tokenizer = |||')
        print(self.tokenizer)
        print('---self.pad_to_multiple_of---')
        print(self.pad_to_multiple_of)
        r"""
        self.tokenizer = PreTrainedTokenizer(name_or_path='/home/...vocab.txt',
        special_tokens={'unk_token':'[UNK]','sep_token':'[SEP]',...'mask_token':'[MASK]'}
        """
    else:
        print('situation2')
        batch = {"input_ids": _collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)}
    print('999batch = 999')
    print(batch)
    r"""
    batch = 
    {'input_ids':tensor(
    [[101,169,...102],
     ................
     [101,169,...102]]),
     'attention_mask':tensor(
    [[1,1,...1,1]
    """
    #batch['input_ids'].shape = ([32,90])
    #batch['attention_mask'].shape = ([32,90])
    print('99999999999999')
    r"""
    batch = 
    {'input_ids': tensor(
    [[  101,   169,   107,  ..., 10539,   107,   102],
    [  101,   169,   107,  ...,   100,   100,   102],
    [  101,   169,   107,  ...,   100,   100,   102],
    ...,
    [  101,   169,   107,  ...,   100,   100,   102],
    [  101,   169,   107,  ...,   100,   100,   102],
    [  101,   169,   107,  ...,   117,   169,   102]]), 
    'attention_mask': tensor(
    [[1, 1, 1,  ..., 1, 1, 1],
    [1, 1, 1,  ..., 1, 1, 1],
    [1, 1, 1,  ..., 1, 1, 1],
    ...,
    [1, 1, 1,  ..., 1, 1, 1],
    [1, 1, 1,  ..., 1, 1, 1],
    [1, 1, 1,  ..., 1, 1, 1]])}
    """
    # If special token mask has been preprocessed, pop it from the dict.
    special_tokens_mask = batch.pop("special_tokens_mask", None)
    #special_tokens_mask = None
    r"""
    special_tokens_mask = 
    [[1,0,0,...0,0,1],
     [1,0,0,...1,1,1],
     ...............
     [1,0,0,...0,0,1]]
    """
    if self.mlm:
        #special_tokens_mask = None
        batch["input_ids"], batch["labels"] = self.mask_tokens(
            batch["input_ids"], special_tokens_mask=special_tokens_mask
        )
        r"""
        ***batch = ***
        {'input_ids': tensor(
        [[  101,   169,   107,  ..., 10539,   107,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        ...,
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   103,   169,   102]]), 
        'attention_mask': tensor(
        [[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 
        'labels': tensor(
        [[-100, -100, -100,  ..., -100,  107, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        ...,
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ...,  117, -100, -100]])}
        """
    else:
        labels = batch["input_ids"].clone()
        if self.tokenizer.pad_token_id is not None:
            labels[labels == self.tokenizer.pad_token_id] = -100
        batch["labels"] = labels
    return batch

首先输出

data_collator examples =
[{'input_ids': tensor([  101,   169,   107, 10539,   142,  8231,   107,   131,   107,  8360,
         8717,  8139,  9099,  8168,  9267,  8157,  8177,  9446,  8177,  9419,
         8510, 10340, 10696,  8129, 11008,  8160,  8204,  8849,  8152,  8139,
         ...........
         107, 10539,   107,   131,   107,   100,   100,   100,   100,   102])},
         ......
 {'input_ids': tensor([  101,   169,   107, 10539,   142,  8231,   107,   131,   107,  9226,
                          9102,  9039,  8854,  8748,  8159,  9717,  8204,  8189,  9242,  8168,
                          8189, 11219, 11414,  8148,  8154,  9102,  9410,  8157,  8139,   107,
                           117,   107,   159,  8803,  8449,   107,   131,   107,   100,   100,
                           100,   100,   100,   100,   100,   100,   100,   100,   100,   100,
                           100,   100,   100,   100,   100,   100,   100,   100,   100,   127,
                           100,   100,   100,   100,   100,   107,   117,   107,  9109,  9172,
                          8521,  8299,   107,   131,   138,   169,   107, 10539,   107,   131,
                           107,   100,   100,   100,   100,   100,   100,   107,   171,   102])}]

接着输出对应的内容

self.pad_to_multiple_of = None

这里得到对应的attention_mask的内容

batch = 
{'input_ids': tensor(
	   [[  101,   169,   107,  ..., 10539,   107,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        ...,
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   117,   169,   102]]), 
 'attention_mask': tensor(
       [[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}

接着调用对应的语句

special_tokens_mask = batch.pop("special_tokens_mask",None)

由于上面的batch之中没有对应的special_tokens_mask的属性，所以得到对应的special_tokens_mask的对应值为None

special_tokens_mask = None

接着进入调用语句

if self.mlm:
    #special_tokens_mask = None
    batch["input_ids"], batch["labels"] = self.mask_tokens(
        batch["input_ids"], special_tokens_mask=special_tokens_mask
    )

这里面需要进入self.mask_tokens去调用

    def mask_tokens(
        self, inputs: torch.Tensor, special_tokens_mask: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        print('data/data_collator.py mask_tokens')
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        labels = inputs.clone()
        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
        r"""
        labels = tensor(
       [[  101,   169,   107,  ..., 10539,   107,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        ...,
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   117,   169,   102]])
        """
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        r"""
        probability_matrix = 
        tensor([[0.1500,0.1500,...],
                [0.1500,0.1500,...],
                ..................
            
        """
        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
        else:
            special_tokens_mask = special_tokens_mask.bool()

        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels

首先复制一下对应的labels的值

labels = inputs.clone()

labels = tensor(
	   [[  101,   169,   107,  ..., 10539,   107,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        ...,
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   100,   100,   102],
        [  101,   169,   107,  ...,   117,   169,   102]]
)

接着调用对应的probability_matrix矩阵

probability_matrix = torch.full(labels.shape,self.mlm_probability)

得到的对应的probability_matrix矩阵

probability_matrix = 
tensor([[0.1500,0.1500,...],
        [0.1500,0.1500,...],
        ..................
        [0.1500,0.1500,...]])

接下来查看对于masked_indices的调用

if special_tokens_mask is None:
    special_tokens_mask = [
        self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    print('special_tokens_mask1 = ')
    print(special_tokens_mask)
    special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
    print('special_tokens_mask2 = ')
    print(special_tokens_mask)
else:
    special_tokens_mask = special_tokens_mask.bool()

得到对应的内容为

special_tokens_mask1 = 
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1],
.....................
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1]]

对应的special_tokens_mask2的内容为

special_tokens_mask2 = 
tensor([[ True, False, False,  ..., False, False,  True],
        ...,
        [ True, False, False,  ..., False, False,  True]])

这里调用special_tokens_mask1需要调用get_special_tokens_mask的函数内容

special_tokens_mask = [
	self.tokenizer.get_special_tokens_mask(val,already_head_special_tokens=True) for valu in labels.tolist()
]

下一篇博客继续解读相应的代码内容

唐僧爱吃唐僧肉

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
预训练过程训练数据分析1

源代码地址大佬预训练代码地址模型结构之前已经进行分析过了，这里从训练过程开始分析training_args = TrainingArguments( output_dir='record', num_train_epochs=num_train_epochs, learning_rate=learning_rate, per_device_train_batch_size=batch_size, save_steps=save_steps, loggin
复制链接

扫一扫

专栏目录