Tokenizer使用（以BertTokenizer为例）

三千院本院

已于 2023-06-29 16:17:50 修改

阅读量1.4k

点赞数

文章标签： python

于 2023-06-29 16:13:49 首次发布

本文链接：https://blog.csdn.net/weixin_42225889/article/details/131458722

版权

'''
Tokenizer快速使用
在使用神经网络处理自然语言处理任务时, 我们首先需要对数据进行预处理, 将数据从字符串转换为神经网络可以接受的格式, 一般会分为如下几步: 

1、分词: 使用分词器对文本数据进行分词（字、字词）；

2、构建词典: 根据数据集分词的结果, 构建词典映射（这一步并不绝对, 如果采用预训练词向量, 词典映射要根据词向量文件进行处理）；

3、数据转换: 根据构建好的词典, 将分词处理后的数据做映射, 将文本序列转换为数字序列；

4、数据填充与截断: 在以batch输入到模型的方式中, 需要对过短的数据进行填充, 过长的数据进行截断, 保证数据长度符合模型能接受的范围, 同时batch内的数据维度大小一致。

如果分词后的结果如下，则说明引用的此表中无对应的汉字，[UNK] 表示占位符进行占位
原句：我爱中华大地
分词后：我 [UNK] 中 [UNK] 大 地  
'''

# 单条数据处理
from transformers import BertTokenizer
# 此处在bert-base-uncased文件夹下添加config.json和vocab.txt文件
tokenizer = BertTokenizer.from_pretrained(r"/bert-base-uncased")

# ====================针对于单句处理==========================
# 句子内容
sentence = '我爱中华大地'

# 句子分词
tokens_one = tokenizer.tokenize(sentence)
print(tokens_one)
# ['我', '爱', '中', '华', '大', '地']

# 词序列转数字序列 ( 获取对应此表ids )
ids = tokenizer.convert_tokens_to_ids(tokens_one)
print(ids)
# [2769, 4263, 704, 1290, 1920, 1765]

# 数字序列转词序列 ( 将ids转换为原文 )
sentence_one = tokenizer.decode(ids)
print(sentence_one)
# 我 爱 中 华 大 地

# 填充
ids_ids = tokenizer.encode(sentence, padding='max_length', max_length=15)
print('ids_ids', ids_ids)
# ids_ids [101, 2769, 4263, 704, 1290, 1920, 1765, 102, 0, 0, 0, 0, 0, 0, 0]

# 截断，需要添加 truncation=True
ids_ids2 = tokenizer.encode(sentence, max_length=3, truncation=True)
print('ids_ids2', ids_ids2)
# ids_ids2 [101, 2769, 102]

# 输出格式
tokenizer_one = tokenizer('我爱中华大地')
print(tokenizer_one)
# {'input_ids': [101, 2769, 4263, 704, 1290, 1920, 1765, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

# 获取attention_mask 与 token_type_id
ids = tokenizer.encode(sentence, padding="max_length", max_length=15)
attention_mask = [1 if idx != 0 else 0 for idx in ids]
token_type_ids = [0] * len(ids)
print('attention_mask', attention_mask, '\ntoken_type_ids', token_type_ids)
# attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] 
# token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# 快速调用方法attention_mask 与 token_type_id, 其实tokenizer()也可以直接获得
inputs = tokenizer.encode_plus(sentence, padding="max_length", max_length=15)
print(inputs)
# {'input_ids': [101, 2769, 4263, 704, 1290, 1920, 1765, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}


# ====================多句处理==========================
sentences = ["我爱中华大地", "我是中国人"]

tokens_many = tokenizer(sentences)
print(tokens_many)
# {'input_ids': [[101, 2769, 4263, 704, 1290, 1920, 1765, 102], [101, 2769, 3221, 704, 1744, 782, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}

# 长度填充
tokens_many_new = tokenizer(sentences, padding='max_length', max_length=12)
print(tokens_many_new)
# {'input_ids': [[101, 2769, 4263, 704, 1290, 1920, 1765, 102, 0, 0, 0, 0], [101, 2769, 3221, 704, 1744, 782, 102, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]}

# 单独获取长句ids
tokens_many_ids = tokenizer(sentences)
ids = tokens_many_ids['input_ids']
print(ids)
# [[101, 2769, 4263, 704, 1290, 1920, 1765, 102], [101, 2769, 3221, 704, 1744, 782, 102]]