目录
🏆 LMSYS Chatbot Arena Leaderboard
Qwen1.5,作为一个AI模型,其底层架构主要基于Transformer模型,这是由Google在2017年提出的深度学习模型,主要用于自然语言处理任务。Transformer的核心是自注意力(Self-Attention)机制和位置编码(Positional Encoding),这两个组件使得模型能够并行处理输入序列,解决了RNN(循环神经网络)等模型在处理长序列时的效率问题。
具体来说,Qwen1.5的架构可能包括以下几个部分:
-
嵌入层(Embedding Layer):将输入的文本转化为向量表示。
-
多头自注意力机制(Multi-Head Self-Attention):这是Transformer的主要创新点,它允许模型在不同位置的词之间建立联系,理解上下文信息。
-
前馈神经网络(Feed-Forward Network):对自注意力机制处理后的向量进行进一步的非线性变换。
-
残差连接(Residual Connection)和层归一化(Layer Normalization):这两部分用于缓解深度网络训练中的梯度消失和爆炸问题。
-
位置编码(Positional Encoding):由于Transformer没有像RNN那样内在的时间顺序处理,所以需要位置编码来引入位置信息。
-
编码器-解码器结构(Encoder-Decoder Structure):在机器翻译等任务中,编码器负责理解输入序列,解码器负责生成输出序列。
-
优化和损失函数:通常使用Adam优化器和交叉熵损失函数进行模型训练。
🏆 LMSYS Chatbot Arena Leaderboard
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Github: GitHub - QwenLM/Qwen1.5: Qwen1.5 is the improved version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
hugging face: https://huggingface.co/Qwen
在线 Demo: https://huggingface.co/spaces/Qwen/Qwen1.5-72B-Chat
Quickstart
🤗 Hugging Face Transformers
Here we show a code snippet to show you how to use the chat model with transformers
:
from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen1.5-72B-Chat", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-72B-Chat") prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
For quantized models, we advise you to use the GPTQ and AWQ correspondents, namely Qwen1.5-7B-Chat-GPTQ-Int8
, Qwen1.5-7B-Chat-AWQ
.
🤖 ModelScope
We strongly advise users especially those in mainland China to use ModelScope. snapshot_download
can help you solve issues concerning downloading checkpoints.
💻 Run locally
llama.cpp
Download our provided GGUF files or create them by yourself, and you can directly use them with the latest llama.cpp with a one-line command:
./main -m <path-to-file> -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
Ollama
We are now on Ollama, and you can use pull
and run
to make things work.
ollama run qwen
You can also add things like ::14B
to choose different models. Visit ollama.ai for more information.
LMStudio
Qwen1.5 has already been supported by lmstudio.ai. You can directly use LMStudio with our GGUF files.
Web UI
Text generation web UI
You can directly use text-generation-webui for creating a web UI demo. If you use GGUF, remember to install the latest wheel of llama.cpp
with the support of Qwen1.5.
llamafile
Clone llamafile, run source install, and then create your own llamafile with the GGUF file following the guide here. You are able to run one line of command, say ./qwen.llamafile
, to create a demo.
Deployment
Now, Qwen1.5 is supported by multiple inference frameworks. Here we demonstrate the usage of vLLM
and SGLang
.
Note
Neither vLLM nor SGLang currently offer built-in support for function calling. If you require tool use capabilities, please refer to Qwen-Agent, which provides a wrapper around these APIs to support function calling.
vLLM
We advise you to use vLLM>=0.3.0
to build OpenAI-compatible API service. Start the server with a chat model, e.g. Qwen1.5-7B-Chat
:
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat
Then use the chat API as demonstrated below:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen1.5-7B-Chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me something about large language models."} ] }'
from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localho