Transformers包常用函数讲解

Transformers常用函数解析

原创

于 2025-10-11 11:25:58 发布 · 348 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#算法

1. AutoTokenizer.from_pretrained(model_path, use_fast=False)

1.1. 首先读取config文件: model_path/tokenizer_config.json

tokenizer_config.json在 transformers/tokenization_utils_base.py中定义:

# Slow tokenizers used to be saved in three separated files
SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
ADDED_TOKENS_FILE = "added_tokens.json"
TOKENIZER_CONFIG_FILE = "tokenizer_config.json"

然后在transformers/models/auto/tokenization_auto.py读取:

commit_hash = kwargs.get("_commit_hash", None)
    resolved_config_file = cached_file(
        pretrained_model_name_or_path,
        TOKENIZER_CONFIG_FILE,
        cache_dir=cache_dir,
        force_download=force_download,
        resume_download=resume_download,
        proxies=proxies,
        use_auth_token=use_auth_token,
        revision=revision,
        local_files_only=local_files_only,
        subfolder=subfolder,
        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,
        _commit_hash=commit_hash,
    )

ransformers/utils/hub.py中定义cached_file:

def cached_file(
    path_or_repo_id: Union[str, os.PathLike],
    filename: str,
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    force_download: bool = False,
    resume_download: bool = False,
    proxies: Optional[Dict[str, str]] = None,
    use_auth_token: Optional[Union[bool, str]] = None,
    revision: Optional[str] = None,
    local_files_only: bool = False,
    subfolder: str = "",
    repo_type: Optional[str] = None,
    user_agent: Optional[Union[str, Dict[str, str]]] = None,
    _raise_exceptions_for_missing_entries: bool = True,
    _raise_exceptions_for_connection_errors: bool = True,
    _commit_hash: Optional[str] = None,
):
    
    if is_offline_mode() and not local_files_only:
        logger.info("Offline mode: forcing local_files_only=True")
        local_files_only = True
    if subfolder is None:
        subfolder = ""

    path_or_repo_id = str(path_or_repo_id)
    full_filename = os.path.join(subfolder, filename)
    if os.path.isdir(path_or_repo_id):
        resolved_file = os.path.join(os.path.join(path_or_repo_id, subfolder), filename)
        if not os.path.isfile(resolved_file):
            if _raise_exceptions_for_missing_entries:
                raise EnvironmentError(
                    f"{
     
     path_or_repo_id} does not appear to have a file named {
     
     full_filename}. Checkout "
                    f"'https://huggingface.co/{
     
     path_or_repo_id}/{
     
     revision}' for available files."
                )
            else:
                return None
        return resolved_file

所以AutoTokenizer得配置文件为model_path/tokenizer_config.json中定义
其中定义的一个值为:

"tokenizer_class": "LlamaTokenizer",

然后在transformers/models/auto/tokenization_auto.py使用

tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)

def tokenizer_class_from_name(class_name: str):
    if class_name == "PreTrainedTokenizerFast":
        return PreTrainedTokenizerFast

    for module_name, tokenizers in TOKENIZER_MAPPING_NAMES.items():
        if class_name in tokenizers:
            module_name = model_type_to_module_name(module_name)

            module = importlib.import_module(f".{
     
     module_name}", "transformers.models")
            try:
                return getattr(module, class_name)
            except AttributeError:
                continue
    return None

返回的类为"<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>"

然后在transformers/models/auto/tokenization_auto.py中调用


if tokenizer_class is None:
   tokenizer_class_candidate = config_tokenizer_class
   tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
   
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

然后再

# At this point pretrained_model_name_or_path is either a directory or a model identifier name
additional_files_names = {
   
   
    "added_tokens_file": ADDED_TOKENS_FILE,
    "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
    "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
}
vocab_files = {
   
   **cls.vocab_files_names, **additional_files_names}

其中cls: <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer

vocab_files = {
   
   
	'vocab_file': 'tokenizer.model', 
	'added_tokens_file': 'added_tokens.json', 
	'special_tokens_map_file': 'special_tokens_map.json', 
	'tokenizer_config_file': 'tokenizer_config.json'
}

然后获取这里面每个文件的绝对路径:

resolved_vocab_files = 
{
   
   
'vocab_file': 'model_path/tokenizer.model',
'added_tokens_file': None,
'special_tokens_map_file': 'model_path/special_tokens_map.json',
'tokenizer_config_file': 'model_path/tokenizer_config.json'
}

然后再``解析:

return cls._from_pretrained(
            resolved_vocab_files,
            pretrained_model_name_or_path,
            init_configuration,
            *init_inputs,
            use_auth_token=token,
            cache_dir=cache_dir,
            local_files_only=local_files_only,
            _commit_hash=commit_hash,
            _is_local=is_local,
            **kwargs,
        )

其中cls: <class transformers.models.llama.tokenization_llama.LlamaTokenizer", 然后得到以下参数初始化类<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer:

{
   
   
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": false,
  "model_max_length": 2048,
  "pad_token": null,
  "padding_side": "right",
  "sp_model_kwargs": {
   
   },
  "unk_token": "<unk>",
  "vocab_file": "./work_dirs/llama-vid/llama-vid-7b-full-224-video-fps-1/tokenizer.model",
  "special_tokens_map_file": "./work_dirs/llama-vid/llama-vid-7b-full-224-video-fps-1/special_tokens_map.json",
  "name_or_path": "./work_dirs/llama-vid/llama-vid-7b-full-224-video-fps-1"
}

try:
   tokenizer = cls(*init_inputs, **init_kwargs)
except OSError:
   raise OSError(
       "Unable to load vocabulary from file. "
       "Please check that the provided vocabulary is accessible and not corrupted."
   )

然后在transformers/models/llama/tokenization_llama.py`初始化中使用加载

import sentencepiece as spm
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)

`vocab_file=‘tokenizer.model’

最后就是返回一个tokenizer对象

LlamaTokenizer(
    name_or_path='./work_dirs/llama-vid/llama-vid-7b-full-224-video-fps-1',
    vocab_size=32000,
    model_max_length=2048,
    is_fast=False,
    padding_side='right',
    truncation_side='right',
    special_tokens={
   
   
        'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False),
        'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False),
        'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False),
        'pad_token': '<unk>'
    },
    clean_up_tokenization_spaces=False
)

2. LlavaLlamaAttForCausalLM.from_pretrained

首先在``中加载参数:

if not isinstance(config, PretrainedConfig):
   config_path = config if config is not None else pretrained_model_name_or_path
   config, model_kwargs = cls.config_class.from_pretrained(
       config_path,
       cache_dir=cache_dir,
       return_unused_kwargs=True,
       force_download=force_download,
       resume_download=resume_download,
       proxies=proxies,
       local_files_only=local_files_only,
       token=token,
       revision=revision,
       subfolder=subfolder,
       _from_auto=from_auto_class,
       _from_pipeline=from_pipeline,
       **kwargs,
   )
else:
   model_kwargs = kwargs

cls: <class 'llamavid.model.language_model.llava_llama_vid.LlavaLlamaAttForCausalLM'>
cls.config_class: <class 'llamavid.model.language_model.llava_llama_vid.LlavaConfig'>

然后在transformers/configuration_utils.py中取出model_path/config.json

else:
       configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME)

       try:
           # Load from local folder or from cache or download from model Hub and cache

最低0.47元/天解锁文章