鉴于之前确实没有好好阅读官方文档和论文,对于bert tokenizer的截断和填充是用列表进行操作,且是直接补0。
其实tokenizer.encode有填补和截断的操作
目录
encoder参数
encode(text: Union[str, List[str], List[int]],
text_pair: Union[str, List[str], List[int], NoneType] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False,
truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False,
max_length: Union[int, NoneType] = None,
stride: int = 0,
return_tensors: Union[str, transformers.file_utils.TensorType, NoneType] = None, **kwargs)
-> List[int] method of transformers.models.bert.tokenization_bert.BertTokenizer instance
input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length = 100,padding='max_length', truncation=True))
假如输入这样一句话
sentence = "I like learning natural language processing! I like learning natural language processing! I like learning natural language processing! I like learning natural language processing!"
将其填补到100
input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length = 100,padding='max_length', truncation=True))
打印
print(input_ids.shape, '\n', input_ids)
可以看一下对应的分词情况
encode = tokenizer.decode(input_ids)
encode
同样的方法看一下截断的效果
截断到10
想要多长的长度直接更改max_length即可,最大512
input_ids = torch.tensor(tokenizer.encode(sentence, add_special_tokens=True, max_length=10, padding='max_length', truncation=True))
print(input_ids.shape, '\n', input_ids)
encode = tokenizer.decode(input_ids)
print(encode)
可以看出来我之前用for循环去操作列表是非常愚蠢的行为,看来还是在实践之前多读论文多看官方文档,多使用help()。
import
from transformers import BertModel, BertTokenizer
# 这里我们调用bert-base模型,同时模型的词典经过小写处理
model_name = 'bert-base-uncased'
# 读取模型对应的tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name, cache_dir='./transformers/bert-base-uncased/')
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__, device)