RTX-3090 Qwen3-8B Dify RAG环境搭建

一、环境配置

属性
CUDA Driver Version 555.42.02
CUDA Version 12.5
OS Ubuntu 20.04.6 LTS
Docker version 24.0.5, build 24.0.5-0ubuntu1~20.04.1
GPU NVIDIA GeForce RTX 3090 24GB显存

二、操作步骤

1、创建容器

docker run --runtime nvidia --gpus all -ti \
    -v $PWD:/home -w /home \
    -p 8000:8000 --ipc=host nvcr.io/nvidia/pytorch:24.03-py3 bash

2、下载Qwen3-8B和embedding模型

cd /home
pip install modelscope
modelscope download --model Qwen/Qwen3-8B  --local_dir Qwen3-8B
modelscope download --model maidalun/bce-embedding-base_v1 --local_dir bce-embedding-base_v1

3、安装transformers

cd /home
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.51.0
pip install tokenizers==0.21
python3 setup.py install

4、安装vllm

pip install vllm
pip install flashinfer-python==v0.2.2
python3 -m pip install --upgrade 'optree>=0.13.0'
pip install bitsandbytes>=0.45.3 -i https://pypi.tuna.tsinghua.edu.cn/simple

5、安装flash-attention

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
git checkout fd2fc9d85c8e54e5c20436465bca709bc1a6c5a1
python setup.py build_ext
python setup.py bdist_wheel
pip install dist/flash_attn-*.whl

6、启动兼容OpenAI API的服务

1、方案一:启动vllm服务【不支持多任务】
cd /home
export TORCH_CUDA_ARCH_LIST="8.6+PTX"
vllm serve ./Qwen3-8B/ --quantization bitsandbytes --enable-prefix-caching --dtype bfloat16
2、方案二:Flask和PyTorch实现的Qwen3-8B和Embeddings 兼容OpenAI API的服务
cat > dify_api_srv.py <<-'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import torch
from transformers import AutoModel
from typing import List
import numpy as np
from transformers import TextStreamer
from flask import Flask, request, jsonify, Response
import uuid
import json

app = Flask(__name__)

# 加载模型和分词器
model_name = "./Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# 加载本地嵌入模型
MODEL_PATH = "./bce-embedding-base_v1"  # 本地模型路径
rerank_tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
rerank_model = AutoModel.from_pretrained(MODEL_PATH)
    
@app.route('/v1/completions', methods=['POST'])
def handle_completion():
    """处理文本补全请求"""
    data = request.get_json()
    print(data)
    
    # 解析请求参数
    prompt = data.get('prompt', '')
    max_tokens = data.get('max_tokens', 32768)
    temperature = float(data.get('temperature', 1.0))
    top_p = float(data.get('top_p', 1.0))
    
    # 构建模型输入
    messages = [{
   "role": "user", "content": prompt}]
    formatted_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )
    inputs = tokenizer(formatted_text, return_tensors="pt").to(model.device)
    
    # 生成文本
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        do_sample=False,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # 解析生成结果
    output_ids = generated_ids[0][len(inputs.input_ids[0]):]
    try:
        think_token_id = tokenizer.convert_tokens_to_ids("</think>")
        index = len(output_ids
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hi20240217

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值