在使用transformers里的GPT2Tokenizer时,看到一句话:
GPT-2 BPE tokenizer. Peculiarities:
-
Byte-level Byte-Pair-Encoding
-
Requires a space to start the input string => the encoding methods should be called with the
add_prefix_space
flag set toTrue
. Otherwise, this tokenizerencode
anddecode
method will not conserve the absence of a space at the beginning of a string:
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
当时没有很理解这句话,一次查看其字典时,发现:
"negative": 31591,
"Ġnegative": 4633,
这是两个一样的单词,但是一个前边有Ġ,当时猜测类似后缀。
实际上当negative前有空格时,编码为4633,无空格时编码为31591,如Y negative中negative编码为4633,而Ynegative中negative编码为4633。可以认为不带Ġ的是一个后缀,而带Ġ的表示以该词开始的单词。
而add_prefix_space是说,如果编码时,不加这个,默认该字符前没有空格,实际是不妥的,如一句话"Attention is all you need",中第一个单词应该表示开头,但是如果直接
tokenizer.encode("Attention is all you need"),则Attention将被认为是一个后缀。所以应该加上add_prefix_space=True。
实验验证:
方法1
每一次都如:
tokenizer.encode("negative", add_prefix_space=True)
import warnings
warnings.filterwarnings("ignore")
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import tokenizers
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.save_pretrained('./config')
text = "I love you"
print(text)
print(tokenizer.decode(tokenizer.encode(text))) # I love you (I 前无空格)
print(tokenizer.all_special_tokens) # ['<|endoftext|>']
'''
推测 若前边有空格则为 "Ġnegative": 4633, 否则为[31591] 注 字典为"negative": 31591,
'''
print(tokenizer.encode("negative")) # [31591] 注 字典为"negative": 31591,
print(tokenizer.encode("negativeY")) # [31591, 56] 注 字典为"negative": 31591,
print(tokenizer.encode(" negative")) # [4633] 注 "Ġnegative": 4633,
print(tokenizer.encode("you negative")) # [5832, 4633]
print(tokenizer.encode("Knegative")) # [42, 31591]
special_tokens_dict = {'cls_token': '<CLS>', 'bos_token': '<s>'}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print('We have added', num_added_toks, 'tokens') # We have added 2 tokens
print(tokenizer.encode("negative", add_special_tokens=True)) # [31591] 注 字典为"negative": 31591,
print(tokenizer.encode("negative", add_prefix_space=True)) # [4633]
print(tokenizer.encode("<s> negative", add_special_tokens=True)) # [50258, 4633] 注 字典为"negative": 31591,
print(tokenizer.encode("<s> negative", add_special_tokens=True, add_prefix_space=True)) # [50258, 4633] 注 字典为"negative": 31591,
方法2
利用 tokenizers.ByteLevelBPETokenizer
如:
import warnings
warnings.filterwarnings("ignore")
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import tokenizers
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.save_pretrained('./config')
text = "I love you"
PATH = './config/'
tokenizer = tokenizers.ByteLevelBPETokenizer(
vocab_file=PATH + 'vocab.json',
merges_file=PATH + 'merges.txt',
lowercase=False,
add_prefix_space=True
)
print(text)
print(tokenizer.decode(tokenizer.encode(text).ids)) # I love you (I 前有空格)
print(tokenizer.encode("negative").ids) # [4633] 注 "Ġnegative": 4633,
print(tokenizer.encode(" negative").ids) # [4633] 注 "Ġnegative": 4633,
print(tokenizer.encode(" negative").ids) # [220, 4633] 注: "Ġ": 220,
print(tokenizer.encode("you negative").ids) # [345, 4633]
print(tokenizer.encode("Knegative").ids) # [509, 31591]