tokenizing a text is splitting it into words or subwords, which then are converted to ids.
分割文本并不像想象的简单。比如我们要分割句子 “Don’t you love 🤗 Transformers? We sure do.”
1)简单的方法是通过空格
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
2)将标点符号也考虑进去
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
3)将 “Don”, “’”, “t” 改成"Do", “n’t”
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
参考:
https://huggingface.co/transformers/tokenizer_summary.html