BERT
1.vocab
PRETRAINED_VOCAB_ARCHIVE_MAP = {
‘bert-base-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt”,
‘bert-large-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt”,
‘bert-base-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt”,
‘bert-large-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt”,
‘bert-base-multilingual-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt”,
‘bert-base-multilingual-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt”,
‘bert-base-chinese’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt”
}
2.model_bin
PRETRAINED_MODEL_ARCHIVE_MAP = {
‘bert-base-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz”,
‘bert-large-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz”,
‘bert-base-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz”,
‘bert-large-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz”,
‘bert-base-multilingual-uncased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz”,
‘bert-base-multilingual-cased’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz”,
‘bert-base-chinese’: “https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz”
}
RoBERTa
1.config
ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"roberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json",
"roberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-config.json",
"roberta-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-config.json",
"distilroberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-config.json",
"roberta-base-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-openai-detector-config.json",
"roberta-large-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-openai-detector-config.json",
}
2.vocab
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"roberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
"roberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
"roberta-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-vocab.json",
"distilroberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-vocab.json",
"roberta-base-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json",
"roberta-large-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json",
},
"merges_file": {
"roberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
"roberta-large": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
"roberta-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-merges.txt",
"distilroberta-base": "https://s3.amazonaws.com/models.huggingface.co/bert/distilroberta-base-merges.txt",
"roberta-base-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt",
"roberta-large-openai-detector": "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt",
},
}
3.model_bin
ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
"roberta-base": "https://cdn.huggingface.co/roberta-base-pytorch_model.bin",
"roberta-large": "https://cdn.huggingface.co/roberta-large-pytorch_model.bin",
"roberta-large-mnli": "https://cdn.huggingface.co/roberta-large-mnli-pytorch_model.bin",
"distilroberta-base": "https://cdn.huggingface.co/distilroberta-base-pytorch_model.bin",
"roberta-base-openai-detector": "https://cdn.huggingface.co/roberta-base-openai-detector-pytorch_model.bin",
"roberta-large-openai-detector": "https://cdn.huggingface.co/roberta-large-openai-detector-pytorch_model.bin",
}
其他
transformers 的所有模型下载地址:
huggingface-transformers-查看模型的下载地址
同时,BERT和RoBerta输入的max_len都是512.
BERT需要依据cased和uncased来决定是否将word lower case,RoBerta采用BPE算法tokenize,预训练的时候就没有区分大小写,即大小写与否不影响BPE tokenize,所以不需要lower case
参考:
https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/150007
https://blog.csdn.net/weixin_44287209/article/details/108792846
https://zhuanlan.zhihu.com/p/147534390