腾讯混元DiT（Hunyuan-DiT）在线体验及部署经验

最新推荐文章于 2025-02-25 18:47:42 发布

旭_1994

最新推荐文章于 2025-02-25 18:47:42 发布

阅读量3.2k

点赞数 35

文章标签：人工智能

本文链接：https://blog.csdn.net/qq_38944169/article/details/139323083

版权

本文主要介绍使用virtualenv库生成venv，进而部署/运行Hunyuan-DiT大模型的方法，同时分享一些经验。

1. 在线体验

本文代码已部署到百度飞桨AI Studio平台，以供大家在线体验Hunyuan-DiT文生图大模型。

项目链接：腾讯混元DiT（Hunyuan-DiT）在线体验

2. 虚拟环境部署方法

原始Github链接：https://github.com/Tencent/HunyuanDiT

2.1 conda环境部署

官方提供的方法为conda生成虚拟环境，代码如下：

git clone https://github.com/tencent/HunyuanDiT

cd HunyuanDiT

# 1. Prepare conda environment
conda env create -f environment.yml

# 2. Activate the environment
conda activate HunyuanDiT

# 3. Install pip dependencies
python -m pip install -r requirements.txt

# 4. (Optional) Install flash attention v2 for acceleration (requires CUDA 11.6 or above)
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3

environment.yml文件内容如下，可以看到官方使用python 3.8.12版本部署。

name: HunyuanDiT
channels:
  - pytorch
  - nvidia
dependencies:
  - python=3.8.12
  - pytorch=1.13.1
  - pip

2.2 virtualenv环境部署

在百度AI Studio平台上，无法使用上述代码生成conda虚拟环境，故本文探索了使用virtualenv库生成虚拟环境并运行的方法。

实测发现python 3.10下是可以正常部署/运行的。

virtualenv环境部署代码如下：

git clone https://github.com/tencent/HunyuanDiT

cd HunyuanDiT

pip install -U virtualenv

python -m virtualenv venv

source venv/bin/activate

pip install --upgrade pip

pip install -r requirements.txt

pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3

3. 模型下载

该模型较大，共45GB，需要使用 huggingface_hub[cli] 工具下载。

该项目其他python文件中模型路径默认为 HunyuanDiT/ckpts，为避免找不到模型的问题，一定将模型文件下载到 HunyuanDiT/ckpts 下。

模型下载代码：

python -m pip install "huggingface_hub[cli]"

cd ~/HunyuanDiT

mkdir ckpts

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts

模型清单（通过 cd ~/HunyuanDiT/ckpts && du -ah 获取）：

2.0K    ./.gitattributes
16K     ./README.md
1.0K    ./dialoggen/openai/clip-vit-large-patch14-336/tokenizer_config.json
513K    ./dialoggen/openai/clip-vit-large-patch14-336/merges.txt
5.0K    ./dialoggen/openai/clip-vit-large-patch14-336/config.json
512     ./dialoggen/openai/clip-vit-large-patch14-336/special_tokens_map.json
2.2M    ./dialoggen/openai/clip-vit-large-patch14-336/tokenizer.json
1.6G    ./dialoggen/openai/clip-vit-large-patch14-336/tf_model.h5
843K    ./dialoggen/openai/clip-vit-large-patch14-336/vocab.json
512     ./dialoggen/openai/clip-vit-large-patch14-336/preprocessor_config.json
1.5K    ./dialoggen/openai/clip-vit-large-patch14-336/README.md
1.6G    ./dialoggen/openai/clip-vit-large-patch14-336/pytorch_model.bin
3.2G    ./dialoggen/openai/clip-vit-large-patch14-336
3.2G    ./dialoggen/openai
4.7G    ./dialoggen/model-00001-of-00004.safetensors
4.7G    ./dialoggen/model-00002-of-00004.safetensors
512     ./dialoggen/generation_config.json
251M    ./dialoggen/model-00004-of-00004.safetensors
1.0K    ./dialoggen/special_tokens_map.json
4.6G    ./dialoggen/model-00003-of-00004.safetensors
72K     ./dialoggen/model.safetensors.index.json
2.0K    ./dialoggen/config.json
1.5K    ./dialoggen/tokenizer_config.json
482K    ./dialoggen/tokenizer.model
18G     ./dialoggen
22K     ./Notice
291K    ./asset/mllm.png
500K    ./asset/radar.png
5.0M    ./asset/long text understanding.png
356K    ./asset/framework.png
72K     ./asset/logo.png
512     ./asset/chinese elements understanding.png
123K    ./asset/cover.png
6.3M    ./asset
2.9G    ./t2i/model/pytorch_model_module.pt
5.7G    ./t2i/model/pytorch_model_ema.pt
8.5G    ./t2i/model
512     ./t2i/tokenizer/special_tokens_map.json
1.0K    ./t2i/tokenizer/tokenizer_config.json
310K    ./t2i/tokenizer/vocab.txt
107K    ./t2i/tokenizer/vocab_org.txt
422K    ./t2i/tokenizer
3.7G    ./t2i/clip_text_encoder/pytorch_model.bin
1.0K    ./t2i/clip_text_encoder/config.json
3.7G    ./t2i/clip_text_encoder
1.0K    ./t2i/sdxl-vae-fp16-fix/config.json
320M    ./t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin
320M    ./t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.safetensors
639M    ./t2i/sdxl-vae-fp16-fix
14G     ./t2i/mt5/pytorch_model.bin
512     ./t2i/mt5/tokenizer_config.json
512     ./t2i/mt5/generation_config.json
4.2M    ./t2i/mt5/spiece.model
512     ./t2i/mt5/special_tokens_map.json
3.0K    ./t2i/mt5/README.md
1.0K    ./t2i/mt5/config.json
14G     ./t2i/mt5
27G     ./t2i
15K     ./LICENSE.txt
45G     .

4. 运行

4.1 激活虚拟环境

激活conda环境：

cd HunyuanDiT

conda activate HunyuanDiT

激活venv环境：

cd HunyuanDiT

source venv/bin/activate

4.2 运行模型

官方项目内提供两类运行方式，Gradio交互式界面或命令行模式。

Gradio交互式界面运行代码：

# By default, we start a Chinese UI.
python app/hydit_app.py

# Using Flash Attention for acceleration.
python app/hydit_app.py --infer-mode fa

# You can disable the enhancement model if the GPU memory is insufficient.
# The enhancement will be unavailable until you restart the app without the `--no-enhance` flag. 
python app/hydit_app.py --no-enhance

# Start with English UI
python app/hydit_app.py --lang en

# Start a multi-turn T2I generation UI. 
# If your GPU memory is less than 32GB, use '--load-4bit' to enable 4-bit quantization, which requires at least 22GB of memory.
python app/multiTurnT2I_app.py

命令行模式运行代码：

# Prompt Enhancement + Text-to-Image. Torch mode
python sample_t2i.py --prompt "渔舟唱晚"

# Only Text-to-Image. Torch mode
python sample_t2i.py --prompt "渔舟唱晚" --no-enhance

# Only Text-to-Image. Flash Attention mode
python sample_t2i.py --infer-mode fa --prompt "渔舟唱晚"

# Generate an image with other image sizes.
python sample_t2i.py --prompt "渔舟唱晚" --image-size 1280 768

# Prompt Enhancement + Text-to-Image. DialogGen loads with 4-bit quantization, but it may loss performance.
python sample_t2i.py --prompt "渔舟唱晚"  --load-4bit

5. 一些经验

1. 该模型内存占用大于16GB，主要是加载mt5模型的时候，内存不足会自动终止。

2. 该模型显存占用（V100 32G显卡）大于24GB，峰值接近30GB，可能与V100显卡有关。

3. 下载的模型中 HunyuanDiT/ckpts/dialoggen/config.json 文件需要手工修改第50行的路径，否则启动过程中会因为找不到openai模型而重新下载。

此处第50行改为"mm_vision_tower": "/home/aistudio/HunyuanDiT/ckpts/dialoggen/openai/clip-vit-large-patch14-336"

/home/aistudio 字段应为实际运行系统的用户目录。

修改后的 config.json文件内容：

{
  "_name_or_path": "./",
  "architectures": [
    "LlavaMistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "freeze_mm_mlp_adapter": false,
  "freeze_mm_vision_resampler": false,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "image_aspect_ratio": "anyres",
  "image_crop_resolution": 224,
  "image_grid_pinpoints": [
    [
      336,
      672
    ],
    [
      672,
      336
    ],
    [
      672,
      672
    ],
    [
      1008,
      336
    ],
    [
      336,
      1008
    ]
  ],
  "image_split_resolution": 224,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "mm_hidden_size": 1024,
  "mm_patch_merge_type": "spatial_unpad",
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "/home/aistudio/HunyuanDiT/ckpts/dialoggen/openai/clip-vit-large-patch14-336",
  "mm_vision_tower_lr": 2e-06,
  "model_type": "llava_mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 4096,
  "tokenizer_padding_side": "left",
  "torch_dtype": "float16",
  "transformers_version": "4.37.2",
  "tune_mm_mlp_adapter": false,
  "tune_mm_vision_resampler": false,
  "unfreeze_mm_vision_tower": true,
  "use_cache": true,
  "use_mm_proj": true,
  "vocab_size": 32000
}

4. 加载mt5模型时的两个warning解决办法（此处不修改也行，毕竟只是warning，不过感觉修改后加载mt5模型会稍微快一点）。

修改 HunyuanDiT/hydit/modules/text_encoder.py 文件第28行为：

self.tokenizer = AutoTokenizer.from_pretrained(model_dir, legacy=False, use_fast=False)

即添加 legacy=False, use_fast=False 参数。

修改后的 text_encoder.py 文件内容：

import torch
import torch.nn as nn
from transformers import AutoTokenizer, T5EncoderModel, T5ForConditionalGeneration


class MT5Embedder(nn.Module):
    available_models = ["t5-v1_1-xxl"]

    def __init__(
        self,
        model_dir="t5-v1_1-xxl",
        model_kwargs=None,
        torch_dtype=None,
        use_tokenizer_only=False,
        conditional_generation=False,
        max_length=128,
    ):
        super().__init__()
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.torch_dtype = torch_dtype or torch.bfloat16
        self.max_length = max_length
        if model_kwargs is None:
            model_kwargs = {
                # "low_cpu_mem_usage": True,
                "torch_dtype": self.torch_dtype,
            }
        model_kwargs["device_map"] = {"shared": self.device, "encoder": self.device}
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, legacy=False, use_fast=False)
        if use_tokenizer_only:
            return
        if conditional_generation:
            self.model = None
            self.generation_model = T5ForConditionalGeneration.from_pretrained(
                model_dir
            )
            return
        self.model = T5EncoderModel.from_pretrained(model_dir, **model_kwargs).eval().to(self.torch_dtype)

    def get_tokens_and_mask(self, texts):
        text_tokens_and_mask = self.tokenizer(
            texts,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors="pt",
        )
        tokens = text_tokens_and_mask["input_ids"][0]
        mask = text_tokens_and_mask["attention_mask"][0]
        # tokens = torch.tensor(tokens).clone().detach()
        # mask = torch.tensor(mask, dtype=torch.bool).clone().detach()
        return tokens, mask

    def get_text_embeddings(self, texts, attention_mask=True, layer_index=-1):
        text_tokens_and_mask = self.tokenizer(
            texts,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors="pt",
        )

        with torch.no_grad():
            outputs = self.model(
                input_ids=text_tokens_and_mask["input_ids"].to(self.device),
                attention_mask=text_tokens_and_mask["attention_mask"].to(self.device)
                if attention_mask
                else None,
                output_hidden_states=True,
            )
            text_encoder_embs = outputs["hidden_states"][layer_index].detach()

        return text_encoder_embs, text_tokens_and_mask["attention_mask"].to(self.device)

    @torch.no_grad()
    def __call__(self, tokens, attention_mask, layer_index=-1):
        with torch.cuda.amp.autocast():
            outputs = self.model(
                input_ids=tokens,
                attention_mask=attention_mask,
                output_hidden_states=True,
            )

        z = outputs.hidden_states[layer_index].detach()
        return z

    def general(self, text: str):
        # input_ids = input_ids = torch.tensor([list(text.encode("utf-8"))]) + num_special_tokens
        input_ids = self.tokenizer(text, max_length=128).input_ids
        print(input_ids)
        outputs = self.generation_model(input_ids)
        return outputs