HuggingFaceBERT tokenizer.encode笔记

最新推荐文章于 2024-04-11 10:41:20 发布

weixin_42899049

最新推荐文章于 2024-04-11 10:41:20 发布

阅读量2.7k

点赞数

文章标签： bert 人工智能深度学习

本文链接：https://blog.csdn.net/weixin_42899049/article/details/120899642

版权

鉴于之前确实没有好好阅读官方文档和论文，对于bert tokenizer的截断和填充是用列表进行操作，且是直接补0。

其实tokenizer.encode有填补和截断的操作

encoder参数

import

encoder参数

encode(text: Union[str, List[str], List[int]], 
text_pair: Union[str, List[str], List[int], NoneType] = None, 
add_special_tokens: bool = True, 
padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False, 
truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, 
max_length: Union[int, NoneType] = None, 
stride: int = 0, 
return_tensors: Union[str, transformers.file_utils.TensorType, NoneType] = None, **kwargs) 
-> List[int] method of transformers.models.bert.tokenization_bert.BertTokenizer instance

input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length = 100,padding='max_length', truncation=True))

假如输入这样一句话

sentence = "I like learning natural language processing! I like learning natural language processing! I like learning natural language processing! I like learning natural language processing!"

将其填补到100

input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length = 100,padding='max_length', truncation=True))

打印

print(input_ids.shape, '\n', input_ids)

可以看一下对应的分词情况

encode = tokenizer.decode(input_ids)
encode

同样的方法看一下截断的效果

截断到10

想要多长的长度直接更改max_length即可，最大512

input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length=10, padding='max_length', truncation=True))
print(input_ids.shape, '\n', input_ids)
encode = tokenizer.decode(input_ids)
print(encode)

可以看出来我之前用for循环去操作列表是非常愚蠢的行为，看来还是在实践之前多读论文多看官方文档，多使用help()。

import

from transformers import BertModel, BertTokenizer
# 这里我们调用bert-base模型，同时模型的词典经过小写处理
model_name = 'bert-base-uncased'
# 读取模型对应的tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir='./transformers/bert-base-uncased/')

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__, device)

weixin_42899049

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
HuggingFaceBERT tokenizer.encode笔记

鉴于之前确实没有好好阅读官方文档和论文，对于berttokenizer的截断和填充是用列表进行操作，且是直接补0。其实tokenizer.encode有填补和截断的操作目录encoder参数importencoder参数encode(text: Union[str, List[str], List[int]], text_pair: Union[str, List[str], List[int], NoneType] = None, add_special_tokens: b.
复制链接

扫一扫