LMDeploy 项目如何判断当前模型是否被 TurboMind 引擎支持

原创于 2025-11-14 18:48:14 发布 · 492 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#TurboMind #LMDeploy

LLM 专栏收录该内容

43 篇文章

订阅专栏

核心判断逻辑

项目通过 lmdeploy/turbomind/supported_models.py 中的 is_supported() 函数来判断模型是否被 TurboMind 引擎支持。 1

判断流程

检查是否为 TurboMind 工作空间模型:首先检查模型路径下是否存在 triton_models 目录,如果存在则直接判定为支持。 2
检查模型架构:通过 get_model_arch() 获取模型架构,然后在 SUPPORTED_ARCHS 字典中查找该架构是否被支持。 3
检查量化方法:如果模型使用了 smooth_quant 量化方法,则不支持。 4
特殊架构的额外检查:
- Baichuan 模型:13B 版本(40 个注意力头)不支持 5
- Qwen2/Llama 模型:检查 head_dim 是否为 128 或 64 6
- ChatGLM 模型:只支持 40 层的版本,且不支持带视觉配置的版本 7
- InternVL 模型:检查其 LLM 部分的架构和 head_dim 8
- Molmo 模型:检查是否有 num_key_value_heads 配置 9
- DeepSeek-V2 模型:不支持带视觉配置的版本 10

在系统中的使用

这个判断函数在多个地方被调用:

自动后端选择:autoget_backend() 函数使用它来决定是否使用 TurboMind 引擎 11
TurboMind 初始化:在 TurboMind._from_hf() 方法中,会断言模型必须被支持 12
测试验证:在 tests/test_lmdeploy/test_auto_backend.py 中有完整的测试用例验证这个功能 13

Notes

支持的模型架构列表定义在 SUPPORTED_ARCHS 字典中,包括 Llama、InternLM、Qwen、Baichuan、Mixtral 等主流模型系列。 3 完整的支持模型列表可以在文档 docs/zh_cn/supported_models/supported_models.md 中查看。 14

Wiki pages you might want to explore:

System Architecture (InternLM/lmdeploy)

Citations

File: lmdeploy/turbomind/supported_models.py (L7-59)

SUPPORTED_ARCHS = dict(
    # baichuan-7b
    BaiChuanForCausalLM='baichuan',
    # baichuan2-7b, baichuan-13b, baichuan2-13b
    BaichuanForCausalLM='baichuan2',
    # gpt-oss
    GptOssForCausalLM='gpt-oss',
    # internlm
    InternLMForCausalLM='llama',
    # internlm2
    InternLM2ForCausalLM='internlm2',
    # internlm3
    InternLM3ForCausalLM='llama',
    # llama, llama2, alpaca, vicuna, codellama, ultracm, yi,
    # deepseek-coder, deepseek-llm
    LlamaForCausalLM='llama',
    # Qwen 7B-72B, Qwen-VL-7B
    QWenLMHeadModel='qwen',
    # Qwen2
    Qwen2ForCausalLM='qwen2',
    Qwen2MoeForCausalLM='qwen2-moe',
    # Qwen2-VL
    Qwen2VLForConditionalGeneration='qwen2',
    # Qwen2.5-VL
    Qwen2_5_VLForConditionalGeneration='qwen2',
    # Qwen3
    Qwen3ForCausalLM='qwen3',
    Qwen3MoeForCausalLM='qwen3-moe',
    # mistral
    MistralForCausalLM='llama',
    # llava
    LlavaLlamaForCausalLM='llama',
    LlavaMistralForCausalLM='llama',
    LlavaForConditionalGeneration='llava',
    # xcomposer2
    InternLMXComposer2ForCausalLM='xcomposer2',
    # internvl
    InternVLChatModel='internvl',
    # internvl3
    InternVLForConditionalGeneration='internvl',
    InternS1ForConditionalGeneration='internvl',
    # deepseek-vl
    MultiModalityCausalLM='deepseekvl',
    DeepseekV2ForCausalLM='deepseek2',
    # MiniCPMV
    MiniCPMV='minicpmv',
    # chatglm2/3, glm4
    ChatGLMModel='glm4',
    ChatGLMForConditionalGeneration='glm4',
    # mixtral
    MixtralForCausalLM='mixtral',
    MolmoForCausalLM='molmo',
)

File: lmdeploy/turbomind/supported_models.py (L62-81)

def is_supported(model_path: str):
    """Check whether supported by turbomind engine.

    Args:
        model_path (str): the path of a model.
            It could be one of the following options:
                - i) A local directory path of a turbomind model which is
                    converted by `lmdeploy convert` command or download from
                    ii) and iii).
                - ii) The model_id of a lmdeploy-quantized model hosted
                    inside a model repo on huggingface.co, such as
                    "InternLM/internlm-chat-20b-4bit",
                    "lmdeploy/llama2-chat-70b-4bit", etc.
                - iii) The model_id of a model hosted inside a model repo
                    on huggingface.co, such as "internlm/internlm-chat-7b",
                    "Qwen/Qwen-7B-Chat ", "baichuan-inc/Baichuan2-7B-Chat"
                    and so on.
    Returns:
        support_by_turbomind (bool): Whether input model is supported by turbomind engine
    """  # noqa: E501

File: lmdeploy/turbomind/supported_models.py (L89-91)

    triton_model_path = os.path.join(model_path, 'triton_models')
    if os.path.exists(triton_model_path):
        support_by_turbomind = True

File: lmdeploy/turbomind/supported_models.py (L95-98)

        quant_method = search_nested_config(cfg.to_dict(), 'quant_method')
        if quant_method and quant_method in ['smooth_quant']:
            # tm hasn't support quantized models by applying smoothquant
            return False

File: lmdeploy/turbomind/supported_models.py (L103-107)

            if arch == 'BaichuanForCausalLM':
                num_attn_head = cfg.num_attention_heads
                if num_attn_head == 40:
                    # baichuan-13B, baichuan2-13B not supported by turbomind
                    support_by_turbomind = False

File: lmdeploy/turbomind/supported_models.py (L108-109)

            elif arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']:
                support_by_turbomind = _is_head_dim_supported(cfg)

File: lmdeploy/turbomind/supported_models.py (L110-115)

            elif arch in ('ChatGLMModel', 'ChatGLMForConditionalGeneration'):
                # chatglm1/2/3 is not working yet
                support_by_turbomind = cfg.num_layers == 40
                if getattr(cfg, 'vision_config', None) is not None:
                    # glm-4v-9b not supported
                    support_by_turbomind = False

File: lmdeploy/turbomind/supported_models.py (L116-122)

            elif arch == 'InternVLChatModel':
                llm_arch = cfg.llm_config.architectures[0]
                support_by_turbomind = (llm_arch in SUPPORTED_ARCHS and _is_head_dim_supported(cfg.llm_config))
            elif arch in ['LlavaForConditionalGeneration', 'InternVLForConditionalGeneration']:
                llm_arch = cfg.text_config.architectures[0]
                if llm_arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']:
                    support_by_turbomind = _is_head_dim_supported(cfg.text_config)

File: lmdeploy/turbomind/supported_models.py (L123-126)

            elif arch == 'MolmoForCausalLM':
                kv_heads = cfg.num_key_value_heads
                # TM hasn't supported allenai/Molmo-7B-O-0924 yet
                support_by_turbomind = kv_heads is not None

File: lmdeploy/turbomind/supported_models.py (L127-129)

            elif arch == 'DeepseekV2ForCausalLM':
                if getattr(cfg, 'vision_config', None) is not None:
                    support_by_turbomind = False

File: lmdeploy/archs.py (L38-54)

        from lmdeploy.turbomind.supported_models import is_supported as is_supported_turbomind
        turbomind_has = is_supported_turbomind(model_path)
    except ImportError:
        is_turbomind_installed = False

    if is_turbomind_installed:
        if not turbomind_has:
            logger.warning('Fallback to pytorch engine because '
                           f'`{model_path}` not supported by turbomind'
                           ' engine.')
    else:
        logger.warning('Fallback to pytorch engine because turbomind engine is not '
                       'installed correctly. If you insist to use turbomind engine, '
                       'you may need to reinstall lmdeploy from pypi or build from '
                       'source and try again.')

    backend = 'turbomind' if turbomind_has else 'pytorch'

File: lmdeploy/turbomind/turbomind.py (L260-261)

        assert is_supported(model_path), (f'turbomind does not support {model_path}. '
                                          'Plz try pytorch engine instead.')

File: tests/test_lmdeploy/test_auto_backend.py (L41-45)

    def test_turbomind_is_supported(self, turbomind_workspace, models):
        from lmdeploy.turbomind.supported_models import is_supported
        assert is_supported(turbomind_workspace) is True
        for m, flag in models:
            assert is_supported(m) is flag

File: docs/zh_cn/supported_models/supported_models.md (L1-57)

支持的模型

以下列表分别为 LMDeploy TurboMind 引擎和 PyTorch 引擎在不同软硬件平台下支持的模型

TurboMind CUDA 平台

Model	Size	Type	FP16/BF16	KV INT8	KV INT4	W4A16
Llama	7B - 65B	LLM	Yes	Yes	Yes	Yes
Llama2	7B - 70B	LLM	Yes	Yes	Yes	Yes
Llama3	8B, 70B	LLM	Yes	Yes	Yes	Yes
Llama3.1	8B, 70B	LLM	Yes	Yes	Yes	Yes
Llama3.2^[2]	1B, 3B	LLM	Yes	Yes*	Yes*	Yes
InternLM	7B - 20B	LLM	Yes	Yes	Yes	Yes
InternLM2	7B - 20B	LLM	Yes	Yes	Yes	Yes
InternLM2.5	7B	LLM	Yes	Yes	Yes	Yes
InternLM3	8B	LLM	Yes	Yes	Yes	Yes
InternLM-XComposer2	7B, 4khd-7B	MLLM	Yes	Yes	Yes	Yes
InternLM-XComposer2.5	7B	MLLM	Yes	Yes	Yes	Yes
Intern-S1	241B	MLLM	Yes	Yes	Yes	No
Intern-S1-mini	8.3B	MLLM	Yes	Yes	Yes	No
Qwen	1.8B - 72B	LLM	Yes	Yes	Yes	Yes
Qwen1.5^[1]	1.8B - 110B	LLM	Yes	Yes	Yes	Yes
Qwen2^[2]	0.5B - 72B	LLM	Yes	Yes*	Yes*	Yes
Qwen2-MoE	57BA14B	LLM	Yes	Yes	Yes	Yes
Qwen3	0.6B-235B	LLM	Yes	Yes	Yes*	Yes
Qwen2.5^[2]	0.5B - 72B	LLM	Yes	Yes*	Yes*	Yes
Mistral^[1]	7B	LLM	Yes	Yes	Yes	No
Mixtral	8x7B, 8x22B	LLM	Yes	Yes	Yes	Yes
DeepSeek-V2	16B, 236B	LLM	Yes	Yes	Yes	No
DeepSeek-V2.5	236B	LLM	Yes	Yes	Yes	No
Qwen-VL	7B	MLLM	Yes	Yes	Yes	Yes
DeepSeek-VL	7B	MLLM	Yes	Yes	Yes	Yes
Baichuan	7B	LLM	Yes	Yes	Yes	Yes
Baichuan2	7B	LLM	Yes	Yes	Yes	Yes
Code Llama	7B - 34B	LLM	Yes	Yes	Yes	No
YI	6B - 34B	LLM	Yes	Yes	Yes	Yes
LLaVA(1.5,1.6)	7B - 34B	MLLM	Yes	Yes	Yes	Yes
InternVL	v1.1 - v1.5	MLLM	Yes	Yes	Yes	Yes
InternVL2	1-2B, 8B - 76B	MLLM	Yes	Yes*	Yes*	Yes
InternVL2.5(MPO)^[2]	1 - 78B	MLLM	Yes	Yes*	Yes*	Yes
InternVL3^[2]	1 - 78B	MLLM	Yes	Yes*	Yes*	Yes
InternVL3.5^[3]	1 - 241BA28B	MLLM	Yes	Yes*	Yes*	No
ChemVLM	8B - 26B	MLLM	Yes	Yes	Yes	Yes
MiniCPM-Llama3-V-2_5	-	MLLM	Yes	Yes	Yes	Yes
MiniCPM-V-2_6	-	MLLM	Yes	Yes	Yes	Yes
GLM4	9B	LLM	Yes	Yes	Yes	Yes
CodeGeeX4	9B	LLM	Yes	Yes	Yes	-
Molmo	7B-D,72B	MLLM	Yes	Yes	Yes	No
gpt-oss	20B,120B	LLM	Yes	Yes	Yes	Yes

“-” 表示还没有验证。

* [1] turbomind 引擎不支持 window attention。所以，对于应用了 window attention，并开启了对应的开关"use_sliding_window"的模型，比如 Mistral、Qwen1.5 等，在推理时，请选择 pytorch engine
* [2] 当模型的 head_dim 非 128 时，turbomind 不支持它的 kv cache 4/8 bit 量化和推理。比如，llama3.2-1B，qwen2-0.5B，internvl2-1B 等等