MCP太火了,在MCP的概念中,有MCP Server(就是我们自己写的或者别人上传的工具),MCP Client(完成调用MCP Server的地方,比如Cherry Studio,Cline等等;像Cherry Studio,通常来帮我们协调模型和工具之间的联系,比如工具描述的包装,模型输出工具调用的解析,调用MCP Server,返回MCP Server的结果给模型,并以用户友好的方式展示工具调用过程)
这里,我使用Cloudfare的AI Gateway,它可以帮我们抓使用http(s)调用大模型的数据流,比如openai的或者google的都支持
大概就是先把/v1/chat/completion这种请求发送到cloudfare,然后cloudfare再发到原先的目标大模型的地址(比如deepseek的https://api.deepseek.com)
操作步骤
-
打开cloudfare:https://dash.cloudflare.com/,在左侧选择AI Gateway,创建一个
-
点击右上角的API,我这里选择Deepseek
-
把Cloudfare提供的中转地址(图上的API Endpoint)填到Cherry Studio里去(即在Cherry Studio里创建一个模型,API KEY填deepseek官网的)
-
使用Cherry Studo,选择刚刚创建的Cloudfare中转的模型,并选择一个MCP服务器,进行提问
这里只用了一个查时间的工具
确实得到结果了 -
之后,在Cloudfare平台上得到一些数据
复制下来查看:
Request1:
{
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "In this environment you have access to a set of tools you can use to answer the user's question. You can use one tool per message, and will receive the result of that tool use in the user's response. You use tools step-by-step to accomplish a given task, with each tool use informed by the result of the previous tool use.\n\n## Tool Use Formatting\n\nTool use is formatted using XML-style tags. The tool name is enclosed in opening and closing tags, and each parameter is similarly enclosed within its own set of tags. Here's the structure:\n\n<tool_use>\n <name>{tool_name}</name>\n <arguments>{json_arguments}</arguments>\n</tool_use>\n\nThe tool name should be the exact name of the tool you are using, and the arguments should be a JSON object containing the parameters required by that tool. For example:\n<tool_use>\n <name>python_interpreter</name>\n <arguments>{\"code\": \"5 + 3 + 1294.678\"}</arguments>\n</tool_use>\n\nThe user will respond with the result of the tool use, which should be formatted as follows:\n\n<tool_use_result>\n <name>{tool_name}</name>\n <result>{result}</result>\n</tool_use_result>\n\nThe result should be a string, which can represent a file or any other output type. You can use this result as input for the next action.\nFor example, if the result of the tool use is an image file, you can use it in the next action like this:\n\n<tool_use>\n <name>image_transformer</name>\n <arguments>{\"image\": \"image_1.jpg\"}</arguments>\n</tool_use>\n\nAlways adhere to this format for the tool use to ensure proper parsing and execution.\n\n## Tool Use Examples\n\nHere are a few examples using notional tools:\n---\nUser: Generate an image of the oldest person in this document.\n\nAssistant: I can use the document_qa tool to find out who the oldest person is in the document.\n<tool_use>\n <name>document_qa</name>\n <arguments>{\"document\": \"document.pdf\", \"question\": \"Who is the oldest person mentioned?\"}</arguments>\n</tool_use>\n\nUser: <tool_use_result>\n <name>document_qa</name>\n <result>John Doe, a 55 year old lumberjack living in Newfoundland.</result>\n</tool_use_result>\n\nAssistant: I can use the image_generator tool to create a portrait of John Doe.\n<tool_use>\n <name>image_generator</name>\n <arguments>{\"prompt\": \"A portrait of John Doe, a 55-year-old man living in Canada.\"}</arguments>\n</tool_use>\n\nUser: <tool_use_result>\n <name>image_generator</name>\n <result>image.png</result>\n</tool_use_result>\n\nAssistant: the image is generated as image.png\n\n---\nUser: \"What is the result of the following operation: 5 + 3 + 1294.678?\"\n\nAssistant: I can use the python_interpreter tool to calculate the result of the operation.\n<tool_use>\n <name>python_interpreter</name>\n <arguments>{\"code\": \"5 + 3 + 1294.678\"}</arguments>\n</tool_use>\n\nUser: <tool_use_result>\n <name>python_interpreter</name>\n <result>1302.678</result>\n</tool_use_result>\n\nAssistant: The result of the operation is 1302.678.\n\n---\nUser: \"Which city has the highest population , Guangzhou or Shanghai?\"\n\nAssistant: I can use the search tool to find the population of Guangzhou.\n<tool_use>\n <name>search</name>\n <arguments>{\"query\": \"Population Guangzhou\"}</arguments>\n</tool_use>\n\nUser: <tool_use_result>\n <name>search</name>\n <result>Guangzhou has a population of 15 million inhabitants as of 2021.</result>\n</tool_use_result>\n\nAssistant: I can use the search tool to find the population of Shanghai.\n<tool_use>\n <name>search</name>\n <arguments>{\"query\": \"Population Shanghai\"}</arguments>\n</tool_use>\n\nUser: <tool_use_result>\n <name>search</name>\n <result>26 million (2019)</result>\n</tool_use_result>\nAssistant: The population of Shanghai is 26 million, while Guangzhou has a population of 15 million. Therefore, Shanghai has the highest population.\n\n\n## Tool Use Available Tools\nAbove example were using notional tools that might not exist for you. You only have access to these tools:\n<tools>\n\n<tool>\n <name>fF7XC7MXYVyewGZyq9INMU</name>\n <description>Get current time.\n\n </description>\n <arguments>\n {\"type\":\"object\",\"properties\":{},\"title\":\"get_timeArguments\"}\n </arguments>\n</tool>\n\n</tools>\n\n## Tool Use Rules\nHere are the rules you should always follow to solve your task:\n1. Always use the right arguments for the tools. Never use variable names as the action arguments, use the value instead.\n2. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself.\n3. If no tool call is needed, just answer the question directly.\n4. Never re-do a tool call that you previously did with the exact same parameters.\n5. For tool use, MARK SURE use XML tag format as shown in the examples above. Do not use any other format.\n\n# User Instructions\n\n\nNow Begin! If you solve the task correctly, you will receive a reward of $1,000,000.\n"
},
{
"role": "user",
"content": "Hey but do you know what time is it?"
}
],
"stream": true
}
很明显的OpenAI格式,有system和user,在system里描述了工具
查看system prompt,发现工具描述、工具使用方法、工具样例等全部都是放在system prompt里,这里只有一个获取当前时间的工具
那么,如果工具多的话,上下文会不会很容易就爆了呢
System prompt里面大概说了些啥:告诉模型可以用工具,要用xml格式,给了几个调用和返回和结果,当前环境有的工具
最后一句话比较有意思:
If you solve the task correctly, you will receive a reward of $1,000,000.
这是激励模型的方法吗?!
其实这就是所谓的MCP Client要做的事之一:把工具描述转为prompt
TODO: 听说之前cherry studio用的是function call,而不直接转为system prompt,有待进一步考证
目前我使用的cs的版本(在archlinux下使用yay安装的):
local/cherry-studio-bin 1.2.10-1
🍒 Cherry Studio is a desktop client that supports for multiple LLM providers
然后我们看回复:
{
"id": "6d8d4563-50a3-415b-89ad-f6c95d8886a6",
"model": "deepseek-chat",
"choices": [
{
"delta": {
"role": "assistant",
"content": "I can use the time tool to get the current time for you.\n\n<tool_use>\n <name>fF7XC7MXYVyewGZyq9INMU</name>\n <arguments>{}</arguments>\n</tool_use>"
}
}
],
"usage": {
"prompt_tokens": 1245,
"completion_tokens": 53,
"total_tokens": 1298,
"prompt_tokens_details": {
"cached_tokens": 0
},
"prompt_cache_hit_tokens": 0,
"prompt_cache_miss_tokens": 1245
},
"streamed_data": [
{
"nonce": "622b8901fb",
"id": "6d8d4563-50a3-415b-89ad-f6c95d8886a6",
"object": "chat.completion.chunk",
"created": 1747480175,
"model": "deepseek-chat",
"system_fingerprint": "fp_8802369eaa_prod0425fp8",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"content": ""
},
"logprobs": null,
"finish_reason": null
}
]
},
{
"nonce": "390c",
"id": "6d8d4563-50a3-415b-89ad-f6c95d8886a6",
"object": "chat.completion.chunk",
"created": 1747480175,
"model": "deepseek-chat",
"system_fingerprint": "fp_8802369eaa_prod0425fp8",
"choices": [
{
"index": 0,
"delta": {
"content": "I"
},
"logprobs": null,
"finish_reason": null
}
]
},
{
"nonce": "b7a54f1f11",
"id": "6d8d4563-50a3-415b-89ad-f6c95d8886a6",
"object": "chat.completion.chunk",
"created": 1747480175,
"model": "deepseek-chat",
"system_fingerprint": "fp_8802369eaa_prod0425fp8",
"choices": [
{
"index": 0,
"delta": {
"content": " can"
},
"logprobs": null,
"finish_reason": null
}
]
},
...
// 这里省略一大堆同上的json obj,由于是用stream进行token生成,会记录每个token的生成时数据
]
}
可以看到工具调用非常简单,而且和消息放在同一个response里
I can use the time tool to get the current time for you.
<tool_use>
<name>fF7XC7MXYVyewGZyq9INMU</name>
<arguments>{}</arguments>
</tool_use>
这里就是像Cherry Studio这样的MCP Client要做的第二件事了,从回答中解析tool_use的xml标签,并调用MCP Server。
之前说到,早期的Cherry Studio是用function call实现的,用function call的话,OpenAI格式返回的response已经帮我们把function call封装到结果的json里面了(choices里),不用我们再重新写解析tool use的返回。那么这里可以猜想一下,可能是由于Cherry Studio开发团队的时间关系?用function call实现比较方便?Cherry Studio更新几轮后才自己构建system prompt和解析response(TODO,去github看看提交记录)。
接下来,看Cloudfare上的另一个记录
但是我发现这个记录只有response,request显示Not available
按理来说,上面模型用xml回复了要调一个工具,MCP Server执行了工具给MCP Client,然后MCP Client返回(可能是用user的角色)给模型,这里没有抓到
{
"id": "c2f2f9e5-1a0a-45fc-8fe8-3e7030409e00",
"model": "deepseek-chat",
"choices": [
{
"delta": {
"role": "assistant",
"content": "The current time is 19:09:42."
}
}
],
"usage": {
"prompt_tokens": 1339,
"completion_tokens": 11,
"total_tokens": 1350,
"prompt_tokens_details": {
"cached_tokens": 1280
},
"prompt_cache_hit_tokens": 1280,
"prompt_cache_miss_tokens": 59
},
"streamed_data": [
...
]
}
这里就是模型知道了工具调用的结果后模型整理得到后的回复了(当然这里非常简单)
再来看看Cherry Studio是怎么安排前面这两个来自assistant的回复的:
仔细观察上图发现,
上面贴出的JSON中,两个assistant的回复被(视觉上)合并成一条(一个对话结果,就是整合到一个头像下面),这是前端用户体验的工作,让我们觉得模型在思考,然后调用工具,然后得到输出。其实这个过程有:
- 模型生成xml格式的工具调用
- Cherry Studio解析工具调用,向MCP Server发送请求并得到结果
- Cherry Studio整合结果成xml格式,返回给模型
- 模型结合工具调用的结果,生成最终回答
上面只是个最简单的例子,像Cherry Studio这样的MCP Client是怎样协调多个工具调用,整理模型的原始输出的,我有时间会分析一下Cherry Studio客户端的源码。
最后,附上这个例子的Cherry Studio的MCP的默认提示词
In this environment you have access to a set of tools you can use to answer the user's question. You can use one tool per message, and will receive the result of that tool use in the user's response. You use tools step-by-step to accomplish a given task, with each tool use informed by the result of the previous tool use.
## Tool Use Formatting
Tool use is formatted using XML-style tags. The tool name is enclosed in opening and closing tags, and each parameter is similarly enclosed within its own set of tags. Here's the structure:
<tool_use>
<name>{tool_name}</name>
<arguments>{json_arguments}</arguments>
</tool_use>
The tool name should be the exact name of the tool you are using, and the arguments should be a JSON object containing the parameters required by that tool. For example:
<tool_use>
<name>python_interpreter</name>
<arguments>{"code": "5 + 3 + 1294.678"}</arguments>
</tool_use>
The user will respond with the result of the tool use, which should be formatted as follows:
<tool_use_result>
<name>{tool_name}</name>
<result>{result}</result>
</tool_use_result>
The result should be a string, which can represent a file or any other output type. You can use this result as input for the next action.
For example, if the result of the tool use is an image file, you can use it in the next action like this:
<tool_use>
<name>image_transformer</name>
<arguments>{"image": "image_1.jpg"}</arguments>
</tool_use>
Always adhere to this format for the tool use to ensure proper parsing and execution.
## Tool Use Examples
Here are a few examples using notional tools:
---
User: Generate an image of the oldest person in this document.
Assistant: I can use the document_qa tool to find out who the oldest person is in the document.
<tool_use>
<name>document_qa</name>
<arguments>{"document": "document.pdf", "question": "Who is the oldest person mentioned?"}</arguments>
</tool_use>
User: <tool_use_result>
<name>document_qa</name>
<result>John Doe, a 55 year old lumberjack living in Newfoundland.</result>
</tool_use_result>
Assistant: I can use the image_generator tool to create a portrait of John Doe.
<tool_use>
<name>image_generator</name>
<arguments>{"prompt": "A portrait of John Doe, a 55-year-old man living in Canada."}</arguments>
</tool_use>
User: <tool_use_result>
<name>image_generator</name>
<result>image.png</result>
</tool_use_result>
Assistant: the image is generated as image.png
---
User: "What is the result of the following operation: 5 + 3 + 1294.678?"
Assistant: I can use the python_interpreter tool to calculate the result of the operation.
<tool_use>
<name>python_interpreter</name>
<arguments>{"code": "5 + 3 + 1294.678"}</arguments>
</tool_use>
User: <tool_use_result>
<name>python_interpreter</name>
<result>1302.678</result>
</tool_use_result>
Assistant: The result of the operation is 1302.678.
---
User: "Which city has the highest population , Guangzhou or Shanghai?"
Assistant: I can use the search tool to find the population of Guangzhou.
<tool_use>
<name>search</name>
<arguments>{"query": "Population Guangzhou"}</arguments>
</tool_use>
User: <tool_use_result>
<name>search</name>
<result>Guangzhou has a population of 15 million inhabitants as of 2021.</result>
</tool_use_result>
Assistant: I can use the search tool to find the population of Shanghai.
<tool_use>
<name>search</name>
<arguments>{"query": "Population Shanghai"}</arguments>
</tool_use>
User: <tool_use_result>
<name>search</name>
<result>26 million (2019)</result>
</tool_use_result>
Assistant: The population of Shanghai is 26 million, while Guangzhou has a population of 15 million. Therefore, Shanghai has the highest population.
## Tool Use Available Tools
Above example were using notional tools that might not exist for you. You only have access to these tools:
<tools>
<tool>
<name>fF7XC7MXYVyewGZyq9INMU</name>
<description>Get current time.
</description>
<arguments>
{"type":"object","properties":{},"title":"get_timeArguments"}
</arguments>
</tool>
</tools>
## Tool Use Rules
Here are the rules you should always follow to solve your task:
1. Always use the right arguments for the tools. Never use variable names as the action arguments, use the value instead.
2. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself.
3. If no tool call is needed, just answer the question directly.
4. Never re-do a tool call that you previously did with the exact same parameters.
5. For tool use, MARK SURE use XML tag format as shown in the examples above. Do not use any other format.
# User Instructions
Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.