BertTokenizerFast与BertTokenizer有什么不同？

最新推荐文章于 2025-01-13 14:18:03 发布

勤奋的懒猫

最新推荐文章于 2025-01-13 14:18:03 发布

阅读量3.6k

点赞数 5

文章标签：机器学习自然语言处理

本文链接：https://blog.csdn.net/xhw205/article/details/129578988

版权

transformers 载入BERT时，有两个分词器，BertTokenizerFast和BertTokenizer有何不同？

from transformers import BertTokenizerFast, BertTokenizer

fast_tokenizer = BertTokenizerFast.from_pretrained('./bert_base/')
tokenizer = BertTokenizer.from_pretrained('./bert_base/')
input = "cvpr的论文"

先演示 BertTokenizerFast

#直接用类名，会返回BERT输入的三要素：input_ids、token_type_ids、attention_mask
fast_sample = fast_tokenizer(input, max_length=256, truncation=True, add_special_tokens=True, return_offsets_mapping=True) #默认添加2个起止字符, truncation代表是否截断

输出：{'input_ids': [101, 10718, 11426, 4638, 6389, 3152, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 2), (2, 4), (4, 5), (5, 6), (6, 7), (0, 0)]}}

# tokenize方法只返回分词结果
fast_sample = fast_tokenizer.tokenize(input, max_length=256, truncation=True)

输出：['cv', '##pr', '的', '论', '文']}

# encode 方法只返回token在vocabulary中的index，即input_ids
fast_sample = fast_tokenizer.encode(input, max_length=256, truncation=True)

输出：[101, 10718, 11426, 4638, 6389, 3152, 102]

# encode_plus与直接调用类名返回结果一致
fast_sample = fast_tokenizer.encode_plus(input, max_length=256, truncation=True, return_offsets_mapping=True, add_special_tokens=True)

再看 BertTokenizer

# 与 fast 相比，无法使用 return_offsets_mapping 参数，无法直接返回token在句中的起止位置
sample = tokenizer(input, max_length=256, truncation=True, add_special_tokens=True)

输出： {'input_ids': [101, 10718, 11426, 4638, 6389, 3152, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

# 与 fast 相比，无法使用max_length与truncation参数
sample = tokenizer.tokenize(input)

输出：['cv', '##pr', '的', '论', '文']

#与fast中的encode一致
sample = tokenizer.encode(input, max_length=256, truncation=True) #只返回token在vocabulary中的index，即input_ids

输出： [101, 10718, 11426, 4638, 6389, 3152, 102]

# 与 fast 相比，无法使用 return_offsets_mapping 参数，无法直接返回token在句中的起止位置
sample = tokenizer.encode_plus(input, max_length=256, truncation=True,  add_special_tokens=True)

输出：{'input_ids': [101, 10718, 11426, 4638, 6389, 3152, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}]

此外二者增加新词的操作也一致，例如cvpr中的pr被单独切分为##pr，可以将pr添加为单独的token。

special_tokens_dict = {'additional_special_tokens': ["pr"]}
tokenizer.add_special_tokens(special_tokens_dict)
fast_tokenizer.add_special_tokens(special_tokens_dict)
print(fast_tokenizer.tokenize(input)) #['cv', 'pr', '的', '论', '文']
print(tokenizer.tokenize(input)) #['cv', 'pr', '的', '论', '文']