tokenizer.batch_encode_plus

最新推荐文章于 2024-08-30 15:03:59 发布

鹰立如睡

最新推荐文章于 2024-08-30 15:03:59 发布

阅读量9.4k

点赞数 6

分类专栏：自然语言处理文章标签：自然语言处理

本文链接：https://blog.csdn.net/lvgaoyanh/article/details/119778214

版权

自然语言处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

注释是输出

tokenizer = BertTokenizer.from_pretrained('C:\\Users\\lgy\\Desktop\\fsdownload\\bert-base-uncased')
print(tokenizer.mask_token) # [MASK]
print(tokenizer.convert_tokens_to_ids('a')) # 1037
print(tokenizer.convert_ids_to_tokens(1037)) # a

string = "test batch encode plus"
strings = [string,string]
tokens = tokenizer.tokenize(string)
print(tokens)#['test', 'batch', 'en', '##code', 'plus']
out = tokenizer.batch_encode_plus(strings,max_length=10,padding='max_length',truncation='longest_first')#长的截，短的补
print(out)# {'input_ids': [[101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0], [101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}