1 tokenization
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-chinese")
# 返回bert切成token之后的结果
tz.tokenize("今天的天气怎么样")
['今', '天', '的', '天', '气', '怎', '么', '样']
# 将token转化成对应的id,如果token不存在,返回未登录词的token 100
tz.convert_tokens_to_ids(tz.tokenize("今天的天气怎么样"))
[791, 1921, 4638, 1921, 3698, 2582, 720, 3416]
BERT - Tokenization and Encoding | Albert Au Yeung
2 bert改进
http://fancyerii.github.io/
http://fancyerii.github.io/2019/08/02/bert-pretrain-imp/