在阿里云上部署Qwen-7B和Qwen-VL

本文详细介绍了如何在Python环境中部署两个预训练模型Qwen-7B和Qwen-VL,包括创建虚拟环境、克隆代码、安装依赖、配置文件修改以及运行示例。特别提到了针对特定硬件(如CUDA和fp16/bf16)的优化选项。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、部署Qwen-7B

1.创建conda虚拟环境

conda create -n qwen-7b python=3.10
conda activate qwen-7b

2.拉取代码

git clone https://github.com/QwenLM/Qwen-7B.git

3.进入Qwen-7B文件,下载模型

git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat.git

4.安装环境依赖

cd Qwen-7B
pip install -r requirements.txt
pip install -r requirements_web_demo.txt

# 如果使用量化模型Qwen-7B-Chat-Int4,需要安装
pip install auto-gptq optimum

5.修改配置文件

# 修改webdemo.py和cli_demo.py文件
DEFAULT_CKPT_PATH = '/mnt/workspace/Qwen-7B/Qwen-7B-Chat'

# device_map=device_map修改为cuda
model = AutoModelForCausalLM.from_pretrained(
        args.checkpoint_path,
        # device_map=device_map,
        device_map="cuda",
        trust_remote_code=True,
        resume_download=True,
    ).eval()

6.运行

# 网页端
python web_demo.py

# 命令行端
python cli_demo.py

7. 如果您的设备支持fp16或bf16,我们建议安装flash-attention我们现在支持flash Attention 2。)以获得更高的效率和更低的内存占用。 (flash-attention是可选的,项目无需安装即可正常运行

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .

# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary

二、部署Qwen-VL

1.创建conda虚拟环境

conda create -n qwen-vl python=3.10
conda activate qwen-vl

2.拉取代码

git clone https://github.com/QwenLM/Qwen-VL

3.进入Qwen-VL文件,下载模型

git clone https://www.modelscope.cn/qwen/Qwen-VL-Chat-Int4.git

4.安装环境依赖

cd Qwen-VL
pip install -r requirements.txt

# 修改requirements_web_demo.txt中,gradio==3.39
pip install -r requirements_web_demo.txt

 5.安装torch torchvision torchaudio

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.4.2

6.安装A卡版本auto-gptq

pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm542/
pip install -q optimum

7.修改web_demo_mm.py文件

DEFAULT_CKPT_PATH = '/mnt/workspace/Qwen-VL/Qwen-VL-Chat'

# device_map=device_map修改为cuda
model = AutoModelForCausalLM.from_pretrained(
        args.checkpoint_path,
        # device_map=device_map,
        device_map="cuda",
        trust_remote_code=True,
        resume_download=True,
    ).eval()

8.运行

python web_demo_mm.py

### Qwen2-7B-Instruct Model Information and Usage #### Overview of the Qwen2-VL-7B-Instruct Model The Qwen2-VL-7B-Instruct model is a large-scale, multi-modal language model designed to handle various natural language processing tasks with enhanced capabilities in understanding visual content. This model has been pre-trained on extensive datasets that include both textual and image data, making it suitable for applications requiring cross-modal reasoning. #### Installation and Setup To use this specific version of the Qwen2 series, one needs first to ensure proper installation by cloning or downloading the necessary files from an accessible repository. Given potential issues accessing certain websites due to geographical restrictions, users should consider using alternative mirrors such as `https://hf-mirror.com` instead of attempting direct access through sites like Hugging Face[^3]. For setting up locally: 1. Install required tools including `huggingface_hub`. 2. Set environment variables appropriately. 3. Execute commands similar to: ```bash huggingface-cli download Qwen/Qwen2-VL-7B-Instruct --local-dir ./Qwen_VL_7B_Instruct ``` This command will fetch all relevant components needed for running inference against the specified variant of the Qwen family models. #### Fine-Tuning Process Fine-tuning allows adapting pretrained weights into more specialized domains without starting training anew. For instance, when working specifically within the context provided earlier regarding Qwen2-VL, adjustments can be made via LoRA (Low-Rank Adaptation), which modifies only parts of existing parameters while keeping others fixed during optimization processes[^1]. #### Running Inference Locally Once everything is set up correctly, performing offline predictions becomes straightforward once dependencies are resolved. An example workflow might involve loading saved checkpoints followed by passing input prompts through them until outputs meet desired criteria[^2]: ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("./Qwen_VL_7B_Instruct") model = AutoModelForCausalLM.from_pretrained("./Qwen_VL_7B_Instruct") input_text = "Your prompt here" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --related questions-- 1. What preprocessing steps must be taken before feeding images alongside text inputs? 2. How does performance compare between different quantization levels offered by GPTQ? 3. Are there any particular hardware requirements recommended for efficient deployment? 4. Can you provide examples where fine-tuned versions outperform general-purpose ones significantly?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值