Python 使用 tiktoken 计算 token 数量

爱学习的小道长

已于 2024-10-17 22:39:20 修改

阅读量1.4k

点赞数 5

分类专栏： AI 文章标签： python 开发语言

于 2024-10-17 22:25:11 首次发布

本文链接：https://blog.csdn.net/weixin_40378209/article/details/143025586

版权

内容来自：
How to count tokens with Tiktoken

0. 背景

tiktoken是OpenAI开发的一种BPE分词器。
给定一段文本字符串（例如，“tiktoken is great!”）和一种编码方式（例如，“cl100k_base”），分词器可以将文本字符串切分成一系列的token（例如，[“t”, “ik”, “token”, " is", " great", “!”]）。
在这里插入图片描述

1. 安装 tiktoken

安装

$ pip install tiktoken

更新

$ pip install --upgrade tiktoken
...
Installing collected packages: tiktoken
  Attempting uninstall: tiktoken
    Found existing installation: tiktoken 0.7.0
    Uninstalling tiktoken-0.7.0:
      Successfully uninstalled tiktoken-0.7.0
Successfully installed tiktoken-0.8.0

2. 使用

import tiktoken
import os


#第一次运行时，它将需要互联网连接进行下载,所以设置环境代理，后续运行不需要互联网连接。
os.environ["http_proxy"] = "socks5://127.0.0.1:1080"
os.environ["https_proxy"] = "socks5://127.0.0.1:1080"

#按名称加载编码
encoding = tiktoken.get_encoding("cl100k_base")
print(encoding)
#加载给定模型名称的编码
encoding = tiktoken.encoding_for_model("gpt-4")
print(encoding)
#.encode() 方法将字符串转换成一系列代表这些文本的整数 token
encode = encoding.encode("China is great!")
print(encode)
#.decode() 整数 token 列表转化成字符串
print(encoding.decode(encode))

Source_numbers =[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
print(encoding.decode(Source_numbers))
#.decode()可以应用于单个标记，对于不在 utf-8 边界上的标记，它可能会有损失。
#对于单个标记，.decode_single_token_bytes()安全地将单个整数token转换为它所代表的字节。
print([encoding.decode_single_token_bytes(token) for token in Source_numbers])

输出结果：

<Encoding 'cl100k_base'>
<Encoding 'cl100k_base'>
[23078, 374, 2294, 0]
China is great!
!"#$%&'()*+,-./012345
[b'!', b'"', b'#', b'$', b'%', b'&', b"'", b'(', b')', b'*', b'+', b',', b'-', b'.', b'/', b'0', b'1', b'2', b'3', b'4', b'5']

对比 OpenAI Tokenizer
在这里插入图片描述
分割方式：

3. 函数

3.1 统计token数量

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """返回文本字符串中的Token数量"""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens = num_tokens_from_string("China is great!", "cl100k_base")
print(num_tokens)

输出结果：

3.2 比较不同字符串在不同的编码方式下的表现

Encoding_method = ["r50k_base", "p50k_base", "cl100k_base", "o200k_base"]

def compare_encodings(example_string: str) -> None:
    print(f'\nExample string: "{example_string}"')
    for encoding_name in Encoding_method:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token bytes: {token_bytes}")


compare_encodings("3 * 12 = 36")
print("**"*30)
compare_encodings("俄罗斯的首都是莫斯科")
print("**"*30)
compare_encodings("ロシアの首都はモスクワ")
print("**"*30)
compare_encodings("Столицей России является Москва")

输出结果：

Example string: "3 * 12 = 36"

r50k_base: 5 tokens
token integers: [18, 1635, 1105, 796, 4570]
token bytes: [b'3', b' *', b' 12', b' =', b' 36']

p50k_base: 5 tokens
token integers: [18, 1635, 1105, 796, 4570]
token bytes: [b'3', b' *', b' 12', b' =', b' 36']

cl100k_base: 7 tokens
token integers: [18, 353, 220, 717, 284, 220, 1927]
token bytes: [b'3', b' *', b' ', b'12', b' =', b' ', b'36']

o200k_base: 7 tokens
token integers: [18, 425, 220, 899, 314, 220, 2636]
token bytes: [b'3', b' *', b' ', b'12', b' =', b' ', b'36']
************************************************************

Example string: "俄罗斯的首都是莫斯科"

r50k_base: 22 tokens
token integers: [46479, 226, 163, 121, 245, 23877, 107, 21410, 165, 99, 244, 32849, 121, 42468, 164, 236, 104, 23877, 107, 163, 100, 239]
token bytes: [b'\xe4\xbf', b'\x84', b'\xe7', b'\xbd', b'\x97', b'\xe6\x96', b'\xaf', b'\xe7\x9a\x84', b'\xe9', b'\xa6', b'\x96', b'\xe9\x83', b'\xbd', b'\xe6\x98\xaf', b'\xe8', b'\x8e', b'\xab', b'\xe6\x96', b'\xaf', b'\xe7', b'\xa7', b'\x91']

p50k_base: 22 tokens
token integers: [46479, 226, 163, 121, 245, 23877, 107, 21410, 165, 99, 244, 32849, 121, 42468, 164, 236, 104, 23877, 107, 163, 100, 239]
token bytes: [b'\xe4\xbf', b'\x84', b'\xe7', b'\xbd', b'\x97', b'\xe6\x96', b'\xaf', b'\xe7\x9a\x84', b'\xe9', b'\xa6', b'\x96', b'\xe9\x83', b'\xbd', b'\xe6\x98\xaf', b'\xe8', b'\x8e', b'\xab', b'\xe6\x96', b'\xaf', b'\xe7', b'\xa7', b'\x91']

cl100k_base: 16 tokens
token integers: [11743, 226, 15581, 245,

最低0.47元/天解锁文章