transformers-tokenizer备忘

最新推荐文章于 2024-07-16 14:00:23 发布

ox180x

最新推荐文章于 2024-07-16 14:00:23 发布

阅读量373

点赞数

本文链接：https://blog.csdn.net/ox180x/article/details/124095845

版权

对transformers库不常用记录，方便回溯。

知识点

1. fast的含义

比如BertTokenizerFast,use_fast, 示例如下：

`1`	`AutoTokenizer.from_pretrained('hfl/chinese-electra-180g-small-discriminator', use_fast=True)`

它的含义是使用rust加速速度。

嘿嘿，rust现在要进入linux内核了，恭喜恭喜。

2. tokenizer

比如常见的convert_ids_to_tokens，encode, encode_plus等等，下面记录一种对句子对的使用方式.

完整例子可参考ne_bert_mrc.py。

# -*- coding: utf8 -*-
#

from transformers import AutoTokenizer, BertTokenizerFast

tokenizer = AutoTokenizer.from_pretrained('hfl/chinese-electra-180g-small-discriminator', use_fast=True)
question = '南京天气怎么样'  # 7
context = '我今天早上站在阳台看天空，今天南京天气很好！'  # 22

tokenized_examples = tokenizer(
    question,  # 问题文本
    context,  # 篇章文本
    truncation="only_second",  # 截断只发生在第二部分，即篇章
    max_length=20,  # 设定最大长度为384
    # stride=5,  # 设定篇章切片步长为128
    return_overflowing_tokens=True,  # 返回超出最大长度的标记，将篇章切成多片
    return_offsets_mapping=True,  # 返回偏置信息，用于对齐答案位置
    padding="max_length",  # 按最大长度进行补齐
)

print(tokenized_examples)
input_ids = tokenized_examples['input_ids']
token_type_ids = tokenized_examples['token_type_ids']
attention_masks = tokenized_examples['attention_mask']
offset_mappings = tokenized_examples['offset_mapping']
overflow_to_sample_mapping = tokenized_examples['overflow_to_sample_mapping']
for index, _input_ids in enumerate(input_ids):
    print('input_ids -> ', tokenizer.convert_ids_to_tokens(_input_ids))
    print('token_type_ids -> ', token_type_ids[index])
    print('attention_masks -> ', attention_masks[index])
    print('offset_mappings -> ', offset_mappings[index])
    print('overflow_to_sample_mapping -> ', overflow_to_sample_mapping[index])

可以自行改动这个例子，其中stride默认注释掉了，默认为0。

ox180x

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
transformers-tokenizer备忘

对transformers库不常用记录，方便回溯。知识点1. fast的含义比如BertTokenizerFast,use_fast, 示例如下：1AutoTokenizer.from_pretrained('hfl/chinese-electra-180g-small-discriminator', use_fast=True)它的含义是使用rust加速速度。...
复制链接

扫一扫