【大模型系列篇】Qwen3开源全新一代大语言模型来了，深入思考，更快行动

木亦汐丫

已于 2025-04-30 14:44:43 修改

阅读量849

点赞数 35

分类专栏：大模型文章标签： Qwen3 混合推理模型原生支持MCP MoE Dense 强化学习长推理链冷启动

于 2025-04-30 14:02:28 首次发布

本文链接：https://blog.csdn.net/Jackie_vip/article/details/147630463

版权

大模型专栏收录该内容

45 篇文章

订阅专栏

`Qwen3开源模型全览`

`Qwen3是全球最强开源模型``（MoE+Dense）`

Qwen3 采用混合专家（MoE）架构，总参数量 235B，激活仅需 22B。

Qwen3 预训练数据量达 36T，并在后训练阶段多轮强化学习，将非思考模式无缝整合到思考模型中。

Qwen3 在推理、指令遵循、工具调用、多语言能力等方面均大幅增强，即创下所有国产模型及全球开源模型的性能新高：

在奥数水平的 AIME25 测评中，Qwen3 斩获 81.5 分，刷新开源纪录；
在考察代码能力的 LiveCodeBench 评测中，Qwen3 突破 70 分大关，表现甚至超过 Grok3；
在评估模型人类偏好对齐的 ArenaHard 测评中，Qwen3 以 95.6 分超越 OpenAI-o1 及 DeepSeek-R1。

性能大幅提升的同时，Qwen3 的部署成本还大幅下降，仅需 4 张 H20 即可部署 Qwen3 满血版，显存占用仅为性能相近模型的三分之一。

Qwen3 还提供了丰富的模型版本，包含 2 款 30B、235B 的 MoE 模型，以及 0.6B、1.7B、4B、8B、14B、32B 等 6 款密集模型。

每款模型均斩获同尺寸开源模型 SOTA（最佳性能）：

Qwen3 的 30B 参数 MoE 模型实现了 10 倍以上的模型性能杠杆提升，仅激活 3B 就能媲美上代 Qwen2.5-32B 模型性能；

Qwen3 的稠密模型性能继续突破，一半的参数量可实现同样的高性能，如 32B 版本的 Qwen3 模型可跨级超越 Qwen2.5-72B 性能。

同时，所有 Qwen3 模型都是混合推理模型，API 可按需设置「思考预算」（即预期最大深度思考的 tokens 数量），进行不同程度的思考，灵活满足 AI 应用和不同场景对性能和成本的多样需求。

比如，4B 模型是手机端的绝佳尺寸；8B 可在电脑和汽车端侧丝滑部署应用；32B 最受企业大规模部署欢迎，有条件的开发者也可轻松上手。

比如让它做一款记忆配对卡牌的 Web 小游戏，效果如下:

Qwen3的主要特点

Qwen3 是国内首个支持“混合推理”模型

Qwen3 原生支持思考模式与非思考模式两种工作方式，意味着既能在简单问题上快思考，秒出答案；又能在复杂问题上慢思考，展开多步推理和深入分析。这种设计让用户可以根据不同任务，轻松调整花多少费用，既省成本又保证推理效果。

思考模式：在这种模式下，模型在提供最终答案之前需要时间逐步推理。这对于需要深入思考的复杂问题来说是理想的选择。
非思考模式：在这里，该模型提供快速、近乎即时的回答，适用于速度比深度更重要的简单问题。

这种灵活性允许用户根据手头的任务控制模型执行多少 “思考”。例如，较难的问题可以通过扩展推理来解决，而较简单的问题可以毫不拖延地直接回答。至关重要的是，这两种模式的融合极大地增强了模型实施稳定高效的思维预算控制的能力。如上所述，Qwen3 表现出可扩展且流畅的性能改进，这与分配的计算推理预算直接相关。这种设计使用户能够更轻松地配置特定于任务的预算，从而在成本效率和推理质量之间实现更理想的平衡。

`Qwen3` 原生支持 MCP 协议

Qwen3 为即将到来的智能体 Agent 和大模型应用爆发提供了更好的支持。在大模型从“聊天”走向“动手做事”的关键时刻，Qwen3 的设计也跟着升级了，不再只是回答问题那么简单，而是专门为 Agent 架构做了优化，提升了执行任务的效率、响应的结构化程度，还有对各种工具的适配能力。

在评估模型 Agent 能力的 BFCL 评测中，Qwen3 创下 70.8 的新高，超越 Gemini2.5-Pro、OpenAI-o1 等顶尖模型，将大幅降低 Agent 调用工具的门槛。

同时，Qwen3 原生支持 MCP 协议，并具备强大的工具调用（function calling）能力，结合封装了工具调用模板和工具调用解析器的 Qwen-Agent 框架，将大大降低编码复杂性，实现高效的手机及电脑 Agent 操作等任务。

此外

Qwen3 系列模型依旧采用宽松的 Apache2.0 协议开源，并首次支持 119 多种语言。

全球开发者、研究机构和企业均可免费在魔搭社区、HuggingFace 等平台下载模型并商用，也可以通过阿里云百炼调用千问 3 的 API 服务。

个人用户可立即通过通义 APP 直接体验 Qwen3，夸克也即将全线接入Qwen3。

据悉，阿里通义已开源 200 余个模型，全球下载量超 3 亿次，千问衍生模型数超 10 万个，已超越美国 Llama，成为全球第一开源模型。

`Qwen3的预训练与后训练`

`Qwen3-预训练数据与三阶段流程`

Qwen3使用了近36万亿token的超大规模多语种数据集进行预训练，是Qwen2.5时期数据量的约2倍。

训练过程分为三个阶段：

Stage 1：在超30万亿token上进行初步预训练，具备通用语言理解与生成能力（上下文长度4K）。
Stage 2：引入大量STEM（科学、技术、工程、数学）数据和代码数据，在额外5万亿token上继续训练。
Stage 3：使用高质量长文本扩展上下文窗口至32K，优化长文档处理能力。

在数据构建过程中，还引入了模型辅助的数据增强技术，如使用Qwen2.5-Math、Qwen2.5-Coder生成高质量数学与代码数据，进一步强化了模型在专业领域的推理与表达能力。

在相同或更少激活参数数量下，Qwen3基础模型性能全面优于同等规模Qwen2.5 Dense模型。

`Qwen3-`后训练强化学习与思考模式融合

为了进一步优化推理链条质量与控制思考模式，Qwen3在后训练阶段设计了四阶段强化学习流程：

Stage 1：长推理链冷启动（Long-CoT Cold Start）: 微调模型在复杂任务（如数学、代码、逻辑推理）中的长推理链条。
Stage 2：推理强化学习（Reasoning RL）: 基于规则奖励机制，强化模型的推理深度与探索能力。
Stage 3：思考模式融合（Thinking Mode Fusion）: 融合思考模式与快速模式，打通推理链条与即时响应路径。
Stage 4：通用强化学习（General RL）：在指令跟随、格式遵循、工具调用等领域进一步微调，提升通用任务完成能力。

同时，针对中小尺寸模型，还引入了Strong-to-Weak蒸馏技术，将大型模型的推理能力高效迁移至轻量级模型中，保证小模型也能具备优秀的推理与交互表现。

Qwen3部署与开发

本地部署

对于部署，可以使用 sglang>=0.4.6.post1 或 vllm>=0.8.4 创建兼容 OpenAI 的 API 终端节点：

SGLang：

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3

vLLM：

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1

如果你用它来本地开发，你可以通过运行一个简单的命令 ollama run qwen3:30b-a3b 来使用 ollama 来玩转模型，或者你可以使用 LMStudio 或 llama.cpp 和 ktransformers 在本地构建。

开发示例

以 Qwen3-30B-A3B 为例

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switch between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

关闭思考模式

如果需要禁用思考模式，只需修改 enable_thinking=False

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # 禁用思考模式
)

高级用法

Qwen3提供了一种软切换机制，允许用户在 enable_thinking=True 时动态控制模型的行为。具体来说，您可以将 /think 和 /no_think 添加到用户提示或系统消息中，以将模型的思维模式从轮次切换到轮次。该模型将遵循多轮次对话中的最新指令。

多轮次对话的示例

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-30B-A3B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

代理用法

Qwen3 在工具调用功能方面表现出色。这里推荐使用 Qwen-Agent 来充分利用 Qwen3 的代理能力。Qwen-Agent 在内部封装了工具调用模板和工具调用解析器，大大降低了编码复杂度。

要给代理定义可用的工具，可以使用 MCP 配置文件，使用 Qwen-Agent 的集成工具，或自行集成其他工具。

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    'model': 'Qwen3-30B-A3B',

    # Use the endpoint provided by Alibaba Model Studio:
    # 'model_type': 'qwen_dashscope',
    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),

    # Use a custom endpoint compatible with OpenAI API:
    'model_server': 'http://localhost:8000/v1',  # api_base
    'api_key': 'EMPTY',

    # Other parameters:
    # 'generate_cfg': {
    #         # Add: When the response content is `<think>this is the thought</think>this is the answer;
    #         # Do not add: When the response has been separated by reasoning_content and content.
    #         'thought_in_content': True,
    #     },
}

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            'time': {
                'command': 'uvx',
                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"]
            }
        }
    },
  'code_interpreter',  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)