样本token过长,怎么保留
解决单个样本输入长度过长超过 预训练模型能接受的输入长度 的问题
// An highlighted block
# 解决单个样本输入长度过长超过 预训练模型能接受的输入长度 的问题
import logging
logging.info('================')
f = open('test.txt')
text = f.readlines()
print(text[1])
# tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")#bert-base-uncased
C_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
token = C_tokenizer(text[1], truncation=True, max_length=40, padding=True, return_overflowing_tokens=True,stride=20,return_tensors="pt")#max_length=10 ,
for i, ipt in enumerate(token["input_ids"]):
print(C_tokenizer.decode(ipt))