本文翻译转载官方文档,有删减
https://huggingface.co/docs/tokenizers/index
https://huggingface.co/docs/tokenizers/quicktour
关于 tokenizer
安装
方式一:pip
pip install tokenizers
方式二:源码构建
需要安装 rust 环境;
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install setuptools_rust
python setup.py install
从零构建 tokenizer
下载数据、解压
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
训练
这里将训练一个 Byte-Pair Encoding (BPE) tokenizer
不同类型的 tokenizer 可参考:https://huggingface.co/transformers/tokenizer_summary.html
from tokenizers import Tokenizer
from tokenizers.models import BPE
# 使用BPE 模型初始化 tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
# 训练
tokenizer.train(files, trainer)
# 保存
tokenizer.save("data/tokenizer-wiki.json")
# 加载
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
# 调用
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
print(output.offsets[9])
# (26, 27)
sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"
使用预训练的 tokenizer
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
如果你有 vocab 文件,也可以直接从 vocb 加载 tokenizer
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
作为测试,你可以从下面地址下载:
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
待续
伊织 2023-02-26