分词器与词表

北落师门XY

已于 2023-10-12 18:59:14 修改

阅读量2.7k

点赞数 1

分类专栏： ML、DL 文章标签： python

于 2022-06-29 13:44:01 首次发布

本文链接：https://blog.csdn.net/weixin_41819299/article/details/125498667

版权

ML、DL 专栏收录该内容

33 篇文章 5 订阅

订阅专栏

TXT及json2种词表文件介绍

有多种词表文件格式：vocab.txt、tokenizer.json。

vocab.txt

特殊token的介绍

[CLS] 开始
[SEP]结束，可用于分隔2个句子
[UNK]未知字符
[MASK]进行mask处
##able 子词的后缀
[unused10] 预留位，便于扩充词表。可在不改变此表大小基础上增加tokenBert如何使用预留的[unused*] - 简书

tokenizer.json

特殊token的介绍

transformers库3种加载词表方法

BertTokenizer

from transformers import BertTokenizer,BertTokenizerFast,AutoTokenizer
string = '2022年06.28，今天  天气真好'

tokenizer = BertTokenizer.from_pretrained('./tmp/vocab.txt')
res= tokenizer(string)
print(tokenizer.decode(res['input_ids']))  # [CLS] 2022 年 06. 28 ， [UNK] 天 天 [UNK] 真 [UNK] [SEP]

BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('./tmp')
res= tokenizer(string)
print(tokenizer.decode(res['input_ids']))  # <s> 2022年06.28,今天 天气真好</s>

AutoTokenizer

import sys
v2_path = './code'
sys.path = [v2_path, v2_path + '/layoutlmv2_xlm/models'] + sys.path  # AutoTokenizer需要一些模型相关脚本
print(sys.path)
import layoutlmv2_xlm  # 做__init__初始化

tokenizer = AutoTokenizer.from_pretrained('./tmp')
res= tokenizer(string,
               return_offsets_mapping=True,
#                max_length=5,   # 设置截断长度
#               truncation=True
              )
print(res)
print(tokenizer.decode(res['input_ids']))
# {'input_ids': [0, 72392, 470, 20773, 3882, 4, 7461, 6, 70871, 5364, 1322, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (4, 5), (5, 8), (8, 10), (10, 11), (11, 13), (15, 16), (15, 17), (17, 18), (18, 19), (0, 0)]}
# <s> 2022年06.28,今天 天气真好</s>

3种方法的区别

BertTokenizerFast比BertTokenizer快：BertTokenizerFast基于 tokenizer 库，RUST 多线程表现好；BertTokenizer基于python

BertTokenizerFast和AutoTokenizer可以处理tokenizer.json，BertTokenizer不支持（不确定是不是特例）

BertTokenizerFast和AutoTokenizer入参为词表所在文件夹，BertTokenizer入参为词表路径

AutoTokenizer基于预训练模型，自适应加载分词器，且会解析model_type等用到模型的一些文件，另外2个基于bert，可以单独使用。

相关函数

decode

decode将token转化为原来的词汇，mask、unk等处不可还原，与encode项对应，但encode一般不用，因为返回信息较少

tokenizer = BertTokenizerFast.from_pretrained('./tmp')
res= tokenizer.encode(string)
print(res)
print(tokenizer.decode(res))  
# [0, 72392, 470, 20773, 3882, 4, 7461, 6, 70871, 5364, 1322, 2]
# <s> 2022年06.28,今天 天气真好</s>

add_tokens

增加词表词汇，注意不设置add_tokens的参数special_tokens = True，2022 仍然会分为1个词汇

string = '2022年06.28厷厸厹厺厼厽厾叀叁参叄叅叆叇亝34'
tokenizer = BertTokenizerFast.from_pretrained('./tmp')
res= tokenizer(string)
print(len(tokenizer),'before')
print(tokenizer.decode(res['input_ids']))
print(res)
# tokenizer.add_special_tokens({'additional_special_tokens':['0', '1', '2', '3', '4', '5', '6', '7', '8', '9','厷']})
tokenizer.add_tokens(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9','厷'])
res= tokenizer(string)
print(len(tokenizer),'after')
print(tokenizer.decode(res['input_ids']))
print(res)
# 250007 before
# <s> 2022年06.28<unk>叁参<unk>34</s>
# {'input_ids': [0, 72392, 470, 20773, 3882, 3, 246487, 27644, 3, 10289, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# 250008 after
# <s> 2022年06.28厷 <unk>叁参<unk>34</s>
# {'input_ids': [0, 72392, 470, 20773, 3882, 250007, 6, 3, 246487, 27644, 3, 10289, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

add_special_tokens

增加词表词汇，作为special_token可提高优先级，如下将2022进行了拆分

string = '2022年06.28厷厸厹厺厼厽厾叀叁参叄叅叆叇亝34'
tokenizer = BertTokenizerFast.from_pretrained('./tmp')
res= tokenizer(string)
print(len(tokenizer),'before')
print(tokenizer.decode(res['input_ids']))
print(res)
tokenizer.add_special_tokens({'additional_special_tokens':['0', '1', '2', '3', '4', '5', '6', '7', '8', '9','厷']})
res= tokenizer(string)
print(len(tokenizer),'after')
print(tokenizer.decode(res['input_ids']))
print(res)
# 250007 before
# <s> 2022年06.28<unk>叁参<unk>34</s>
# {'input_ids': [0, 72392, 470, 20773, 3882, 3, 246487, 27644, 3, 10289, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# 250008 after
# <s>2022 年06.28厷 <unk>叁参<unk>34</s>
# {'input_ids': [0, 304, 2389, 304, 304, 6, 470, 2389, 910, 6, 5, 304, 1019, 250007, 6, 3, 246487, 27644, 3, 363, 617, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# 以下2种写法效果 相同
tokenizer.add_special_tokens({'additional_special_tokens':['0', '1', '2', '3', '4', '5', '6', '7', '8', '9','厷']})
# tokenizer.add_tokens(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9','厷'],special_tokens = True)

作为一个batch、作为上下文、batch和上下文的组合

tokenizer = BertTokenizer.from_pretrained('./tmp/vocab.txt')
res= tokenizer(['今天','西瓜美味'])  # batch
# {'input_ids': [[101, 100, 1811, 102], [101, 1947, 100, 1935, 100, 102]], 'token_type_ids': [[0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
res= tokenizer('今天','西瓜美味')  # 上下文
# {'input_ids': [101, 100, 1811, 102, 1947, 100, 1935, 100, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
res= tokenizer(['今天','晴天'],['西瓜美味','雪糕美味'],padding=True)  # 上下文+batch
# {'input_ids': [[101, 100, 1811, 102, 1947, 100, 1935, 100, 102], [101, 100, 1811, 102, 100, 100, 1935, 100, 102]], 'token_type_ids': [[0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

    
print(res)

resize_token_embeddings

该方法被认为是Huggingface实现的一种简单暴力地进行末位删除或扩充

https://github.com/JifeiTune/Huggingface_Bert_change_vocab

在vocab.txt中，如需在词表中添加词汇，直接修改预留的[unused*]即可。Bert预留[unused*] - 知乎

在vocab.json中添加词汇需要add_special_tokens，会改变词表大小，此时需同时修改模型嵌入矩阵的大小使之与词汇表大小相同。

model = AutoModelForTokenClassficiation.from_pretrained(***)
tokenizer= AutoTokenizer.from_pretrained(***)
tokenizer.add_special_tokens({'additional_special_tokens':['*','x']})
model.resize_token_embeddings(len(tokenizer))

json词表扩充后对模型进行扩展权重的另一种方法

【转载】如何在BERT模型中添加自己的词汇教程（pytorch版）_bert base chinese 添加新的词_S大幕的博客-CSDN博客

convert_ids_to_tokens

将input_ids转化为原始文本，不在词表中的文字会显示<unk>，如果需要还原成原文，可以通过offset_mapping

tokenizer.convert_ids_to_tokens(res['input_ids])

----------

可选参数padding、trunction、max_length

is_split_into_words输入已预先进行了分词

北落师门XY

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
分词器与词表

有多种词表文件格式：vocab.txt、tokenizer.json。vocab.txt特殊token的介绍[CLS] 开始[SEP]结束，可用于分隔2个句子[UNK]未知字符[MASK]进行mask处##able 子词的后缀[unused10] 预留位，便于扩充词表。可在不改变此表大小基础上增加tokenBert如何使用预留的[unused*] - 简书tokenizer.json 特殊token的介绍<s>开始</s>结束<pad><
复制链接

扫一扫

专栏目录