LMDeploy 项目如何判断当前模型是否被 TurboMind 引擎支持

核心判断逻辑

项目通过 lmdeploy/turbomind/supported_models.py 中的 is_supported() 函数来判断模型是否被 TurboMind 引擎支持。 1

判断流程

  1. 检查是否为 TurboMind 工作空间模型:首先检查模型路径下是否存在 triton_models 目录,如果存在则直接判定为支持。 2

  2. 检查模型架构:通过 get_model_arch() 获取模型架构,然后在 SUPPORTED_ARCHS 字典中查找该架构是否被支持。 3

  3. 检查量化方法:如果模型使用了 smooth_quant 量化方法,则不支持。 4

  4. 特殊架构的额外检查:

    • Baichuan 模型:13B 版本(40 个注意力头)不支持 5
    • Qwen2/Llama 模型:检查 head_dim 是否为 128 或 64 6
    • ChatGLM 模型:只支持 40 层的版本,且不支持带视觉配置的版本 7
    • InternVL 模型:检查其 LLM 部分的架构和 head_dim 8
    • Molmo 模型:检查是否有 num_key_value_heads 配置 9
    • DeepSeek-V2 模型:不支持带视觉配置的版本 10

在系统中的使用

这个判断函数在多个地方被调用:

  1. 自动后端选择:autoget_backend() 函数使用它来决定是否使用 TurboMind 引擎 11

  2. TurboMind 初始化:在 TurboMind._from_hf() 方法中,会断言模型必须被支持 12

  3. 测试验证:在 tests/test_lmdeploy/test_auto_backend.py 中有完整的测试用例验证这个功能 13

Notes

支持的模型架构列表定义在 SUPPORTED_ARCHS 字典中,包括 Llama、InternLM、Qwen、Baichuan、Mixtral 等主流模型系列。 3 完整的支持模型列表可以在文档 docs/zh_cn/supported_models/supported_models.md 中查看。 14

Wiki pages you might want to explore:

Citations

File: lmdeploy/turbomind/supported_models.py (L7-59)

SUPPORTED_ARCHS = dict(
    # baichuan-7b
    BaiChuanForCausalLM='baichuan',
    # baichuan2-7b, baichuan-13b, baichuan2-13b
    BaichuanForCausalLM='baichuan2',
    # gpt-oss
    GptOssForCausalLM='gpt-oss',
    # internlm
    InternLMForCausalLM='llama',
    # internlm2
    InternLM2ForCausalLM='internlm2',
    # internlm3
    InternLM3ForCausalLM='llama',
    # llama, llama2, alpaca, vicuna, codellama, ultracm, yi,
    # deepseek-coder, deepseek-llm
    LlamaForCausalLM='llama',
    # Qwen 7B-72B, Qwen-VL-7B
    QWenLMHeadModel='qwen',
    # Qwen2
    Qwen2ForCausalLM='qwen2',
    Qwen2MoeForCausalLM='qwen2-moe',
    # Qwen2-VL
    Qwen2VLForConditionalGeneration='qwen2',
    # Qwen2.5-VL
    Qwen2_5_VLForConditionalGeneration='qwen2',
    # Qwen3
    Qwen3ForCausalLM='qwen3',
    Qwen3MoeForCausalLM='qwen3-moe',
    # mistral
    MistralForCausalLM='llama',
    # llava
    LlavaLlamaForCausalLM='llama',
    LlavaMistralForCausalLM='llama',
    LlavaForConditionalGeneration='llava',
    # xcomposer2
    InternLMXComposer2ForCausalLM='xcomposer2',
    # internvl
    InternVLChatModel='internvl',
    # internvl3
    InternVLForConditionalGeneration='internvl',
    InternS1ForConditionalGeneration='internvl',
    # deepseek-vl
    MultiModalityCausalLM='deepseekvl',
    DeepseekV2ForCausalLM='deepseek2',
    # MiniCPMV
    MiniCPMV='minicpmv',
    # chatglm2/3, glm4
    ChatGLMModel='glm4',
    ChatGLMForConditionalGeneration='glm4',
    # mixtral
    MixtralForCausalLM='mixtral',
    MolmoForCausalLM='molmo',
)

File: lmdeploy/turbomind/supported_models.py (L62-81)

def is_supported(model_path: str):
    """Check whether supported by turbomind engine.

    Args:
        model_path (str): the path of a model.
            It could be one of the following options:
                - i) A local directory path of a turbomind model which is
                    converted by `lmdeploy convert` command or download from
                    ii) and iii).
                - ii) The model_id of a lmdeploy-quantized model hosted
                    inside a model repo on huggingface.co, such as
                    "InternLM/internlm-chat-20b-4bit",
                    "lmdeploy/llama2-chat-70b-4bit", etc.
                - iii) The model_id of a model hosted inside a model repo
                    on huggingface.co, such as "internlm/internlm-chat-7b",
                    "Qwen/Qwen-7B-Chat ", "baichuan-inc/Baichuan2-7B-Chat"
                    and so on.
    Returns:
        support_by_turbomind (bool): Whether input model is supported by turbomind engine
    """  # noqa: E501

File: lmdeploy/turbomind/supported_models.py (L89-91)

    triton_model_path = os.path.join(model_path, 'triton_models')
    if os.path.exists(triton_model_path):
        support_by_turbomind = True

File: lmdeploy/turbomind/supported_models.py (L95-98)

        quant_method = search_nested_config(cfg.to_dict(), 'quant_method')
        if quant_method and quant_method in ['smooth_quant']:
            # tm hasn't support quantized models by applying smoothquant
            return False

File: lmdeploy/turbomind/supported_models.py (L103-107)

            if arch == 'BaichuanForCausalLM':
                num_attn_head = cfg.num_attention_heads
                if num_attn_head == 40:
                    # baichuan-13B, baichuan2-13B not supported by turbomind
                    support_by_turbomind = False

File: lmdeploy/turbomind/supported_models.py (L108-109)

            elif arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']:
                support_by_turbomind = _is_head_dim_supported(cfg)

File: lmdeploy/turbomind/supported_models.py (L110-115)

            elif arch in ('ChatGLMModel', 'ChatGLMForConditionalGeneration'):
                # chatglm1/2/3 is not working yet
                support_by_turbomind = cfg.num_layers == 40
                if getattr(cfg, 'vision_config', None) is not None:
                    # glm-4v-9b not supported
                    support_by_turbomind = False

File: lmdeploy/turbomind/supported_models.py (L116-122)

            elif arch == 'InternVLChatModel':
                llm_arch = cfg.llm_config.architectures[0]
                support_by_turbomind = (llm_arch in SUPPORTED_ARCHS and _is_head_dim_supported(cfg.llm_config))
            elif arch in ['LlavaForConditionalGeneration', 'InternVLForConditionalGeneration']:
                llm_arch = cfg.text_config.architectures[0]
                if llm_arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']:
                    support_by_turbomind = _is_head_dim_supported(cfg.text_config)

File: lmdeploy/turbomind/supported_models.py (L123-126)

            elif arch == 'MolmoForCausalLM':
                kv_heads = cfg.num_key_value_heads
                # TM hasn't supported allenai/Molmo-7B-O-0924 yet
                support_by_turbomind = kv_heads is not None

File: lmdeploy/turbomind/supported_models.py (L127-129)

            elif arch == 'DeepseekV2ForCausalLM':
                if getattr(cfg, 'vision_config', None) is not None:
                    support_by_turbomind = False

File: lmdeploy/archs.py (L38-54)

        from lmdeploy.turbomind.supported_models import is_supported as is_supported_turbomind
        turbomind_has = is_supported_turbomind(model_path)
    except ImportError:
        is_turbomind_installed = False

    if is_turbomind_installed:
        if not turbomind_has:
            logger.warning('Fallback to pytorch engine because '
                           f'`{model_path}` not supported by turbomind'
                           ' engine.')
    else:
        logger.warning('Fallback to pytorch engine because turbomind engine is not '
                       'installed correctly. If you insist to use turbomind engine, '
                       'you may need to reinstall lmdeploy from pypi or build from '
                       'source and try again.')

    backend = 'turbomind' if turbomind_has else 'pytorch'

File: lmdeploy/turbomind/turbomind.py (L260-261)

        assert is_supported(model_path), (f'turbomind does not support {model_path}. '
                                          'Plz try pytorch engine instead.')

File: tests/test_lmdeploy/test_auto_backend.py (L41-45)

    def test_turbomind_is_supported(self, turbomind_workspace, models):
        from lmdeploy.turbomind.supported_models import is_supported
        assert is_supported(turbomind_workspace) is True
        for m, flag in models:
            assert is_supported(m) is flag

File: docs/zh_cn/supported_models/supported_models.md (L1-57)

支持的模型

以下列表分别为 LMDeploy TurboMind 引擎和 PyTorch 引擎在不同软硬件平台下支持的模型

TurboMind CUDA 平台

ModelSizeTypeFP16/BF16KV INT8KV INT4W4A16
Llama7B - 65BLLMYesYesYesYes
Llama27B - 70BLLMYesYesYesYes
Llama38B, 70BLLMYesYesYesYes
Llama3.18B, 70BLLMYesYesYesYes
Llama3.2[2]1B, 3BLLMYesYes*Yes*Yes
InternLM7B - 20BLLMYesYesYesYes
InternLM27B - 20BLLMYesYesYesYes
InternLM2.57BLLMYesYesYesYes
InternLM38BLLMYesYesYesYes
InternLM-XComposer27B, 4khd-7BMLLMYesYesYesYes
InternLM-XComposer2.57BMLLMYesYesYesYes
Intern-S1241BMLLMYesYesYesNo
Intern-S1-mini8.3BMLLMYesYesYesNo
Qwen1.8B - 72BLLMYesYesYesYes
Qwen1.5[1]1.8B - 110BLLMYesYesYesYes
Qwen2[2]0.5B - 72BLLMYesYes*Yes*Yes
Qwen2-MoE57BA14BLLMYesYesYesYes
Qwen30.6B-235BLLMYesYesYes*Yes
Qwen2.5[2]0.5B - 72BLLMYesYes*Yes*Yes
Mistral[1]7BLLMYesYesYesNo
Mixtral8x7B, 8x22BLLMYesYesYesYes
DeepSeek-V216B, 236BLLMYesYesYesNo
DeepSeek-V2.5236BLLMYesYesYesNo
Qwen-VL7BMLLMYesYesYesYes
DeepSeek-VL7BMLLMYesYesYesYes
Baichuan7BLLMYesYesYesYes
Baichuan27BLLMYesYesYesYes
Code Llama7B - 34BLLMYesYesYesNo
YI6B - 34BLLMYesYesYesYes
LLaVA(1.5,1.6)7B - 34BMLLMYesYesYesYes
InternVLv1.1 - v1.5MLLMYesYesYesYes
InternVL21-2B, 8B - 76BMLLMYesYes*Yes*Yes
InternVL2.5(MPO)[2]1 - 78BMLLMYesYes*Yes*Yes
InternVL3[2]1 - 78BMLLMYesYes*Yes*Yes
InternVL3.5[3]1 - 241BA28BMLLMYesYes*Yes*No
ChemVLM8B - 26BMLLMYesYesYesYes
MiniCPM-Llama3-V-2_5-MLLMYesYesYesYes
MiniCPM-V-2_6-MLLMYesYesYesYes
GLM49BLLMYesYesYesYes
CodeGeeX49BLLMYesYesYes-
Molmo7B-D,72BMLLMYesYesYesNo
gpt-oss20B,120BLLMYesYesYesYes

“-” 表示还没有验证。

* [1] turbomind 引擎不支持 window attention。所以,对于应用了 window attention,并开启了对应的开关"use_sliding_window"的模型,比如 Mistral、Qwen1.5 等,在推理时,请选择 pytorch engine
* [2] 当模型的 head_dim 非 128 时,turbomind 不支持它的 kv cache 4/8 bit 量化和推理。比如,llama3.2-1B,qwen2-0.5B,internvl2-1B 等等
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值