HuggingFace - tokenizer


本文翻译转载官方文档,有删减
https://huggingface.co/docs/tokenizers/index
https://huggingface.co/docs/tokenizers/quicktour


关于 tokenizer


安装

方式一:pip

pip install tokenizers

方式二:源码构建
需要安装 rust 环境;

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install setuptools_rust
python setup.py install


从零构建 tokenizer

下载数据、解压

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

训练
这里将训练一个 Byte-Pair Encoding (BPE) tokenizer

不同类型的 tokenizer 可参考:https://huggingface.co/transformers/tokenizer_summary.html


from tokenizers import Tokenizer
from tokenizers.models import BPE

# 使用BPE 模型初始化 tokenizer 
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()


files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]

# 训练
tokenizer.train(files, trainer)

# 保存
tokenizer.save("data/tokenizer-wiki.json")

# 加载
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

# 调用
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]

print(output.offsets[9])
# (26, 27)

sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"



使用预训练的 tokenizer

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

如果你有 vocab 文件,也可以直接从 vocb 加载 tokenizer

from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

作为测试,你可以从下面地址下载:

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

待续

伊织 2023-02-26

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值