MiniBPE:探究Github上最简单的BPE实现代码
pytest-chinese-doc
mergeable_ranks
在gpt4.py里面有这样一段代码,mergeable_ranks个人感觉应该就是openai官方用tiktoken训练词汇表的时候,用到的一个合并规则字典?
class GPT4Tokenizer(RegexTokenizer):
"""Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""
def __init__(self):
super().__init__(pattern=GPT4_SPLIT_PATTERN)
# get the official tokenizer and its merges
enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks
# the merges are those of gpt4, but we have to recover them
self.merges = recover_merges(mergeable_ranks)
# reconstruct the vocab from the merges
vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in self.merges.items():
vocab[idx] = vocab[p0] + vocab[p1]
self.vocab = vocab
# now here is another tricky part.
# for some reason, the tokens corresponding to individual bytes
# are permuted in a different order. This is completely non-sensical
# and probably historical, but therefore we have to deal with it here.
self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
# finally register the special tokens
self.register_special_tokens(GPT4_SPECIAL_TOKENS)
但是这里为什么要乱序,不是很理解,给的注释是,for some reason……and probably historical,它对基础的256个词汇进行了shuffle,因为mergeable_ranks就是一个字典,我打印了前300个,可以和正常的vocab对比一下输出
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks
print(type(mergeable_ranks))
i = 0
for k,v in mergeable_ranks.items():
print(k,v)
i+=1
if(i>=300):
break
print("-------------------")
vocab = {idx: bytes([idx]) for idx in range(256)}
for i in vocab:
print(i,vocab[i])
两个辅助函数
def bpe(mergeable_ranks, token, max_rank):
# helper function used in get_gpt4_merges() to reconstruct the merge forest
parts = [bytes([b]) for b in token]
while True:
min_idx = None
min_rank = None
for i, pair in enumerate(zip(parts[:-1], parts[1:])):
rank = mergeable_ranks.get(pair[0] + pair[1])
if rank is not None and (min_rank is None or rank < min_rank):
min_idx = i
min_rank = rank
if min_rank is None or (max_rank is not None and min_rank >= max_rank):
break
assert min_idx is not None
parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
return parts
def recover_merges(mergeable_ranks):
# the `merges` are already the byte sequences in their merged state.
# so we have to recover the original pairings. We can do this by doing
# a small BPE training run on all the tokens, in their order.
# also see https://github.com/openai/tiktoken/issues/60
# also see https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306
merges = {}
for token, rank in mergeable_ranks.items():
if len(token) == 1:
continue # skip raw bytes
pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
assert len(pair) == 2
# recover the integer ranks of the pair
ix0 = mergeable_ranks[pair[0]]
ix1 = mergeable_ranks[pair[1]]
merges[(ix0, ix1)] = rank
return merges
recover_merges 函数的作用是从 mergeable_ranks 字典中恢复原始的合并对。这个字典包含了已经合并的字节序列及其合并排名。函数的目的是找到这些合并字节序列在原始 BPE 训练过程中是如何配对的。内部会调用bpe函数,bpe的目的就是将每个合并后的字节序列,进行复原。
通俗点说,就是mergeable_ranks给出了合并以后的字节序列,而这两个函数做的就是复原,让你看清楚是哪两个字节序列合并的。可以测试一下
python -c "from minbpe import GPT4Tokenizer; GPT4Tokenizer().save_vocab('gpt4.vocab')"
这样会生成一个gpt4.vocab,可以与mergeable_ranks 对比一下,就懂了。
basic和regex
basic.py和regex.py分别代表两种分词策略,basic是直接把文本转成一段字节串,然后直接在一起进行num_merges次合并;regex是对每个子文本块进行合并,不同子块之间不会合并。
regex事先按某种模式对文本进行了切分,减少了一些合并,个人感觉应该更精准?
合并的时候,都是选择出现频率最高的pair进行合并,直到不能合并或者超过合并次数为止。
这部分的代码没啥好说的,都是继承自base.py里面的Tokenizer基类,自己调试一下就很清晰了。
关于调试
值得注意的是,tests目录下有一个测试文件,我们可以通过以下命令,在每个测试用例开始时,都先加载PDB环境:
pytest --trace
这样我们就可以很方便的对每个测试函数进行调试了!