1、想要知道每个piece属于哪个word:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)#use fast tokenizer
piece2word = tokenizer(input_text).words()#首尾为special token
记住: .words()要用fast tokenzier才行,不然会报错:ValueError: words() is not available when using Python-based tokenizers<