minbpe解读

MiniBPE:探究Github上最简单的BPE实现代码
pytest-chinese-doc

mergeable_ranks

在gpt4.py里面有这样一段代码,mergeable_ranks个人感觉应该就是openai官方用tiktoken训练词汇表的时候,用到的一个合并规则字典?

class GPT4Tokenizer(RegexTokenizer):
    """Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""

    def __init__(self):
        super().__init__(pattern=GPT4_SPLIT_PATTERN)
        # get the official tokenizer and its merges
        enc = tiktoken.get_encoding("cl100k_base")
        mergeable_ranks = enc._mergeable_ranks
        # the merges are those of gpt4, but we have to recover them
        self.merges = recover_merges(mergeable_ranks)
        # reconstruct the vocab from the merges
        vocab = {idx: bytes([idx]) for idx in range(256)}
        for (p0, p1), idx in self.merges.items():
            vocab[idx] = vocab[p0] + vocab[p1]
        self.vocab = vocab
        # now here is another tricky part.
        # for some reason, the tokens corresponding to individual bytes
        # are permuted in a different order. This is completely non-sensical
        # and probably historical, but therefore we have to deal with it here.
        self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
        self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
        # finally register the special tokens
        self.register_special_tokens(GPT4_SPECIAL_TOKENS)

但是这里为什么要乱序,不是很理解,给的注释是,for some reason……and probably historical,它对基础的256个词汇进行了shuffle,因为mergeable_ranks就是一个字典,我打印了前300个,可以和正常的vocab对比一下输出

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks
print(type(mergeable_ranks))
i = 0
for k,v in mergeable_ranks.items():
    print(k,v)
    i+=1
    if(i>=300):
        break
    
print("-------------------")
vocab = {idx: bytes([idx]) for idx in range(256)}
for i in vocab:
    print(i,vocab[i])

两个辅助函数

def bpe(mergeable_ranks, token, max_rank):
    # helper function used in get_gpt4_merges() to reconstruct the merge forest
    parts = [bytes([b]) for b in token]
    while True:
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank
        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
            break
        assert min_idx is not None
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
    return parts


def recover_merges(mergeable_ranks):
    # the `merges` are already the byte sequences in their merged state.
    # so we have to recover the original pairings. We can do this by doing
    # a small BPE training run on all the tokens, in their order.
    # also see https://github.com/openai/tiktoken/issues/60
    # also see https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306
    merges = {}
    for token, rank in mergeable_ranks.items():
        if len(token) == 1:
            continue # skip raw bytes
        pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
        assert len(pair) == 2
        # recover the integer ranks of the pair
        ix0 = mergeable_ranks[pair[0]]
        ix1 = mergeable_ranks[pair[1]]
        merges[(ix0, ix1)] = rank

    return merges

recover_merges 函数的作用是从 mergeable_ranks 字典中恢复原始的合并对。这个字典包含了已经合并的字节序列及其合并排名。函数的目的是找到这些合并字节序列在原始 BPE 训练过程中是如何配对的。内部会调用bpe函数,bpe的目的就是将每个合并后的字节序列,进行复原。

通俗点说,就是mergeable_ranks给出了合并以后的字节序列,而这两个函数做的就是复原,让你看清楚是哪两个字节序列合并的。可以测试一下

python -c "from minbpe import GPT4Tokenizer; GPT4Tokenizer().save_vocab('gpt4.vocab')"

这样会生成一个gpt4.vocab,可以与mergeable_ranks 对比一下,就懂了。

basic和regex

basic.py和regex.py分别代表两种分词策略,basic是直接把文本转成一段字节串,然后直接在一起进行num_merges次合并;regex是对每个子文本块进行合并,不同子块之间不会合并。

regex事先按某种模式对文本进行了切分,减少了一些合并,个人感觉应该更精准?

合并的时候,都是选择出现频率最高的pair进行合并,直到不能合并或者超过合并次数为止。

这部分的代码没啥好说的,都是继承自base.py里面的Tokenizer基类,自己调试一下就很清晰了。

关于调试

值得注意的是,tests目录下有一个测试文件,我们可以通过以下命令,在每个测试用例开始时,都先加载PDB环境:

pytest --trace

这样我们就可以很方便的对每个测试函数进行调试了!

  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值