【AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

【AIGC魔童】DeepSeek v3推理部署:华为昇腾NPU/TRT-LLM

(1)使用华为昇腾NPU推理部署DeepSeek

参考博客华为昇腾推理DeepSeek-R1,性能比肩高端GPU,API免费无限量!潞晨自研推理引擎出手了

来自华为昇腾社区的 MindIE 框架成功适配了 DeepSeek-V3 的 BF16 版本。

有关 Ascend NPU 的分步指南,请按照此处的说明进行操作。

(2)使用TRT-LLM推理部署DeepSeek

GitHub地址https://github.com/NVIDIA/TensorRT-LLM

TensorRT-LLM 现在支持 DeepSeek-V3 模型,仅提供 BF16 和 INT4/INT8 权重等精度选项。对 FP8 的支持目前正在进行中,并将很快发布。

您可以通过以下链接访问 TRTLLM 专门用于 DeepSeek-V3 支持的自定义分支,直接体验新功能:https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3

2.8.1 下载DeepSeek模型权重

Download DeepSeek-V3 weights from HF https://huggingface.co/deepseek-ai/DeepSeek-V3-Base.

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

可选项: 转化 FP8 权重到 BF16.

This is not necessary unless you want to run the model E2E in BF16 precision.

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference/
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/DeepSeek-V3 --output-bf16-hf-path /path/to/deepseek-v3-bf16
cp /path/to/DeepSeek-V3/config.json /path/to/DeepSeek-V3/configuration_deepseek.py /path/to/deepseek-v3-bf16/

2.8.2 构建TensorRT引擎

首先,利用convert_checkpoint.py将DeepSeek权重转换为TensorRT-LLM权重,然后,使用TensorRT-LLM权重构建TensorRT引擎。

  • 模型转化

转化为 FP8 权重:

# Convert Deepseek-v3 HF Native FP8 weights to TensorRT-LLM checkpoint.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
                            --dtype bfloat16 \
                            --use_fp8_weights \
                            --tp_size 8 \
                            --workers 8 # using multiple workers can accelerate the conversion process

可选项: 转化为 BF16 权重:

# Convert Deepseek-v3 HF weights to TensorRT-LLM checkpoint in BF16.
python convert_checkpoint.py --model_dir ./DeepSeek-V3 \
                            --output_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
                            --dtype bfloat16 \
                            --tp_size 32 \
                            --workers 8 # using multiple workers can accelerate the conversion process
  • 构建TensorRT引擎

对于FP8模型:

# Build FP8 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_8gpu_fp8 \
            --output_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

对于BF16模型:

# Build BF16 engine
trtllm-build --checkpoint_dir ./trtllm_checkpoint_deepseek_v3_32gpu_bf16 \
            --output_dir ./trtllm_engines/deepseek_v3/bf16/tp32-sel4096-isl2048-bs4 \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16 \
            --max_batch_size 4 \
            --max_seq_len 4096 \
            --max_input_len 2048 \
            --use_paged_context_fmha enable \
            --workers 8

Caution: --max_batch_size and --max_seq_len are the main factors to determine how many GPU memory will be used during runtime, so later when try to run e.g., summarize.py or mmlu.py or gptManagerBenchmark.cppmay need adjust --max_batch_size and --max_seq_len accordingly to avoid OOM.(meaning rebuild TensorRT engine with smaller --max_batch_size and --max_seq_len if needed based on GPU memory size), there is beautiful technical log perf-best-practices.md (https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md) explained the mechanism.

2.8.3 模型推理

使用run.py 脚本测试FP8模型:

# run.sh
python3 ../run.py --input_text "Today is a nice day." \
        --max_output_len 30 \
        --tokenizer_dir ./DeepSeek-V3 \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --top_p 0.95 \
        --temperature 0.3

多节点推理:

srun -N 2 -w node-[1-2] --gres=gpu:8 --ntasks-per-node 8 \
    --container-image tensorrt_llm/release:latest \
    --container-mounts ${PWD}:/workspace \
    sh /workspace/command/run.sh

输出结果:

...
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
Input [Text 0]: "Today is a nice day."
Output [Text 0 Beam 0]: " I am going to the park with my friends. We are going to play soccer. We are going"

2.8.4 模型评估

使用 mmlu.py 脚本实现模型评估:

# Download MMLU dataset
mkdir mmlu_data && cd mmlu_data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
# Run MMLU evaluation
python3 mmlu.py \
        --hf_model_dir ${MODEL_DIR} \
        --engine_dir ./trtllm_engines/deepseek_v3/fp8/tp8-sel4096-isl2048-bs4 \
        --data_dir mmlu_data \
        --test_trt_llm 2>&1 | tee ${ENGINE_DIR}/test_with_mmlu.log

输出结果:

Average accuracy 0.926 - high_school_macroeconomics
Average accuracy 0.752 - high_school_mathematics
Average accuracy 0.954 - high_school_microeconomics
Average accuracy 0.848 - high_school_physics
Average accuracy 0.967 - high_school_psychology
Average accuracy 0.861 - high_school_statistics
Average accuracy 0.956 - high_school_us_history
Average accuracy 0.954 - high_school_world_history
Average accuracy 0.861 - human_aging
Average accuracy 0.931 - human_sexuality
Average accuracy 0.975 - international_law
Average accuracy 0.907 - jurisprudence
Average accuracy 0.920 - logical_fallacies
Average accuracy 0.848 - machine_learning
Average accuracy 0.951 - management
Average accuracy 0.957 - marketing
Average accuracy 0.950 - medical_genetics
Average accuracy 0.957 - miscellaneous
Average accuracy 0.870 - moral_disputes
Average accuracy 0.798 - moral_scenarios
Average accuracy 0.918 - nutrition
Average accuracy 0.916 - philosophy
Average accuracy 0.932 - prehistory
Average accuracy 0.869 - professional_accounting
Average accuracy 0.714 - professional_law
Average accuracy 0.956 - professional_medicine
Average accuracy 0.908 - professional_psychology
Average accuracy 0.800 - public_relations
Average accuracy 0.869 - security_studies
Average accuracy 0.960 - sociology
Average accuracy 0.950 - us_foreign_policy
Average accuracy 0.578 - virology
Average accuracy 0.930 - world_religions
Average accuracy 0.852 - math
Average accuracy 0.874 - health
Average accuracy 0.905 - physics
Average accuracy 0.936 - business
Average accuracy 0.958 - biology
Average accuracy 0.825 - chemistry
Average accuracy 0.888 - computer science
Average accuracy 0.912 - economics
Average accuracy 0.890 - engineering
Average accuracy 0.851 - philosophy
Average accuracy 0.917 - other
Average accuracy 0.932 - history
Average accuracy 0.944 - geography
Average accuracy 0.904 - politics
Average accuracy 0.936 - psychology
Average accuracy 0.949 - culture
Average accuracy 0.744 - law
Average accuracy 0.883 - STEM
Average accuracy 0.827 - humanities
Average accuracy 0.926 - social sciences
Average accuracy 0.898 - other (business, health, misc.)
Average accuracy: 0.877
### 如何在NPU部署DeepSeek-V3框架 #### 准备工作 为了确保能够在Ascend NPU上顺利部署DeepSeek-V3模型,需先确认环境已经安装并配置好必要的依赖库以及工具链。这通常涉及到设置Python虚拟环境、安装特定版本的PyTorch以及其他支持包。 #### 获取预训练模型 可以从Hugging Face平台下载适用于NPU优化过的BF16版本的DeepSeek-V3模型文件[^2]。该链接提供了不同规模变体的选择,根据实际需求挑选合适的权重文件用于后续加载。 #### 调整模型结构适应NPU特性 由于华为Ascend社区的MindIE框架已成功适配了DeepSeek-V3-BF16版本,在此过程中可能涉及到了一些针对硬件特性的调整措施来提高性能表现[^1]。因此建议开发者仔细阅读官方文档了解具体改动细节,并据此修改自己的项目代码以匹配这些变化。 #### 编写推理脚本 对于想要利用vLLM来进行高效推理的应用场景而言,可以参照官方给出的完整指南中的说明完成相应部分的工作[^3]。下面是一个简单的Python示例程序片段展示如何初始化session对象并与之交互: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name_or_path = "path_to_your_model" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) model = AutoModelForCausalLM.from_pretrained( model_name_or_path, torch_dtype=torch.bfloat16 # 使用bfloat16精度加速计算过程 ) input_text = "your input text here." inputs = tokenizer(input_text, return_tensors="pt").to('npu') # 将输入张量移动到NPU设备上执行运算 outputs = model.generate(**inputs) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值