BertTokenizer

最新推荐文章于 2024-08-12 14:08:38 发布

lei_qi

最新推荐文章于 2024-08-12 14:08:38 发布

阅读量2.2k

点赞数 1

分类专栏：工具 python3

本文链接：https://blog.csdn.net/lei_qi/article/details/115311913

版权

工具同时被 2 个专栏收录

29 篇文章 1 订阅

订阅专栏

python3

29 篇文章 0 订阅

订阅专栏

from transformers.tokenization_bert import BertTokenizer


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("词典大小:",tokenizer.vocab_size)
text = "the game has gone!unaffable  I have a new GPU!"
tokens = tokenizer.tokenize(text)
print("英文分词来一个：",tokens)


text = "我爱北京天安门，吢吣"
tokens = tokenizer.tokenize(text)
print("中文分词来一个：",tokens)

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("id-token转换:",input_ids)


sen_code = tokenizer.encode_plus("i like  you  much", "but not him")
print("多句子encode：",sen_code)

print("decode：",tokenizer.decode(sen_code['input_ids']))

词典大小: 30522
英文分词来一个： ['the', 'game', 'has', 'gone', '!', 'una', '##ffa', '##ble', 'i', 'have', 'a', 'new', 'gp', '##u', '!']
中文分词来一个： ['我', '[UNK]', '北', '京', '天', '安', '[UNK]', '，', '[UNK]', '[UNK]']
id-token转换: [1855, 100, 1781, 1755, 1811, 1820, 100, 1989, 100, 100]
多句子encode： {'input_ids': [101, 1045, 2066, 2017, 2172, 102, 2021, 2025, 2032, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
decode： [CLS] i like you much [SEP] but not him [SEP]