Roberta的tokenizer简单使用

from transformers import AutoTokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokens = datasets["train"][4]['tokens']
token_strings = ' '.join(datasets["train"][4]['tokens'])
print('sentence:',token_strings,end = '\n\n')
print('after tokenizer:',tokenizer(token_strings),end = '\n\n')
print('切分结果:',tokenizer(token_strings).word_ids(),end = '\n\n')
print('tokenize结果:',tokenizer.tokenize(token_strings),end = '\n\n')
# 与input_id只差了开头和结尾的标识符
print('tokenize结果转化id:',tokenizer.convert_tokens_to_ids(tokenizer.tokenize(token_strings)))
输出

sentence: Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .

after tokenizer: {‘input_ids’: [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

切分结果: [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 14, 15, 16, 17, 18, 19, 20, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, None]

tokenize结果: [‘germany’, “’”, ‘s’, ‘representative’, ‘to’, ‘the’, ‘european’, ‘union’, “’”, ‘s’, ‘veterinary’, ‘committee’, ‘werner’, ‘z’, ‘##wing’, ‘##mann’, ‘said’, ‘on’, ‘wednesday’, ‘consumers’, ‘should’, ‘buy’, ‘sheep’, ‘##me’, ‘##at’, ‘from’, ‘countries’, ‘other’, ‘than’, ‘britain’, ‘until’, ‘the’, ‘scientific’, ‘advice’, ‘was’, ‘clearer’, ‘.’]

tokenize结果转化id: [2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012]

  • 6
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值