问题
-
当微调base模型后,模型保存位置与原始预训练模型位置不一致时(且模型位置没有分词器)。欲使用SentenceTransformer分别从模型保存位置加载模型、从原始预训练模型位置加载tokenizer,出现如下错误:
If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './models/checkpoint-22600' is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer.
-
原因是需要分别指定model地址和tokenizer地址如下:
model = SentenceTransformer(model_dir, device='cuda:7', tokenizer_kwargs={ 'trust_remote_code': True, 'tokenizer_name_or_path': tokenizer_dir, 'local_files_only': False })
SentenceTransformer底层实际调用AutoModel.from_pretrained和AutoTokenizer.from_pretrained加载模型与分词器:
# 可见sentencetransformer的Transformer.py源码 59 行 self.tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, **tokenizer_args, )
但是还是报原来错误
问题分析
- 分析发现 **tokenizer_args的解包操作并没有将参数覆盖tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path 的值,导致tokenizer使用的还是模型地址,导致加载失败(
模型保存位置如果有分词器那就没事) - 解决方法:
# 可在上述59行源码前先给tokenizer_name_or_path赋值,不使用解包覆盖 tokenizer_name_or_path = tokenizer_args.get('tokenizer_name_or_path')