tiktoken原理以及如何离线环境使用

程序锅锅

已于 2024-04-04 19:32:12 修改

阅读量1.3w

点赞数 45

分类专栏：大模型文章标签：自然语言处理人工智能语言模型 python

于 2024-03-28 23:28:44 首次发布

本文链接：https://blog.csdn.net/qq_35054222/article/details/137127660

版权

大模型专栏收录该内容

19 篇文章

订阅专栏

本文介绍了tiktoken，一种由OpenAI开发的用于快速token切分的工具，尤其适用于GPT等大模型的输入处理。文章详细讲解了如何使用cl100k_base编码，以及在国内因网络限制导致的本地缓存解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

tiktoken原理介绍

tiktoken是OpenAI开发的开源的快速token切分器。

首先我们需要了解的是GPT等大模型，并不是直接将字符串输入大模型，第一步需要做的就是token切分编码。

比如给定一个文本字符串，输入'tiktoken is great!',采用切分编码’cl100k_base’，拆解后的文本字符串为["t", "ik", "token", " is", " great", "!"]。token编码表示为[83, 1609, 5963, 374, 2294,0]。

代码如下

import tiktoken
input = "tiktoken is great!"
enc = tiktoken.get_encoding("cl100k_base")
enc_output =  enc.encode(input)
print("输入文字:"+str(input))
print("编码后的token："+str(enc_output))
for token in enc_output:
    print("将token:"+str(token)+" 变成文本:"+str(enc.decode_single_token_bytes(token)))

文本切分编码是十分有用的，因为GPT都是以token的形式来阅读文本的。了解文本中的token数量，可以告诉你字符串是否太长而超出了模型处理能力。

cl100k_base 网络访问不到

由于cl100k_base是tiktoken中一种编码方式。gpt-4,
gpt-3.5-turbo,text-embedding-ada-002都采用这种切分编码方式。

国内在调用cl100k_base编码的时候，由于某些原因网络访问不到，需要自行下载，然后在本地缓存读取。

报错如下：

Exception type: <class ‘requests.exceptions.ConnectionError’>
Exception value: HTTPSConnectionPool(host=‘openaipublic.blob.core.windows.net’, port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NameResolutionError(“<urllib3.connection.HTTPSConnection object at 0x7fdf29eef1f0>: Failed to resolve ‘openaipublic.blob.core.windows.net’ ([Errno -3] Temporary failure in name resolution)”))

分析主要源代码，如何在本地读取呢？

分析源码可以得知，通过读取环境变量中的TIKTOKEN_CACHE_DIR变量，计算hash值（需将cl100k_base.tiktoken文件重命名为hash值），最后在本地缓存文件夹中读取。

源码如下：

def read_file_cached(blobpath: str, expected_hash: Optional[str] = None) -> bytes:
    user_specified_cache = True
    if "TIKTOKEN_CACHE_DIR" in os.environ:
        cache_dir = os.environ["TIKTOKEN_CACHE_DIR"]
    elif "DATA_GYM_CACHE_DIR" in os.environ:
        cache_dir = os.environ["DATA_GYM_CACHE_DIR"]
    else:
        cache_dir = os.path.join(tempfile.gettempdir(), "data-gym-cache")
        user_specified_cache = False

    if cache_dir == "":
        # disable caching
        return read_file(blobpath)

    cache_key = hashlib.sha1(blobpath.encode()).hexdigest()

    cache_path = os.path.join(cache_dir, cache_key)
    if os.path.exists(cache_path):
        with open(cache_path, "rb") as f:
            data = f.read()
        if expected_hash is None or check_hash(data, expected_hash):
            return data

1.获取blobpath变量(在有网的服务器上获取)，这个blobpath变量存放了cl100k_base.tiktoken的url地址。

import tiktoken_ext.openai_public
import inspect

print(dir(tiktoken_ext.openai_public))
# The encoder we want is cl100k_base, we see this as a possible function

print(inspect.getsource(tiktoken_ext.openai_public.cl100k_base))
# The URL should be in the 'load_tiktoken_bpe function call'

我的blobpath变量为：https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

2.点击cl100k_base.tiktoken下载。(在有网的服务器上下载)

3.获取cache_key变量，并将cl100k_base.tiktoken重命名为cache_key变量。

import hashlib

# 我的blobpath是https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
blobpath = "your_blob_url_here"  
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
print(cache_key)
#输出：9b5ad71b2ce5302211f9c61530b329a4922fc6a4

我将cl100k_base.tiktoken重命名为9b5ad71b2ce5302211f9c61530b329a4922fc6a4

到此为止，文件和变量都拿到了。下面我将阐述如何修改代码，引入缓存库。

1.代码中增加如下代码，自行设置tiktoken_cache_dir，并将tiktoken文件放到tiktoken_cache_dir位置。

import os
import tiktoken

#我这里tiktoken_cache_dir是'/home/workspace'
tiktoken_cache_dir = "path_to_folder_containing_tiktoken_file"
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

#将改名后的cl100k_base.tiktoken文件放到'/home/workspace'位置
assert os.path.exists(os.path.join(tiktoken_cache_dir, cache_key))

2.运行encoding

encoding = tiktoken.get_encoding("cl100k_base")
encoding.encode("Hello, world")

3.最后有什么不明白的可以参考read_file_cached(blobpath: str, expected_hash: Optional[str] = None)这个函数的源码，debug是太难找了。。建议pycharm通过Search Everyting搜索即可查看定位源码。