前面基于订阅智能体SubscriptionRunner实现的订阅智能体,看起来总让人有点疑惑:GitHubTrendingWatcher是继承Role类定义的智能体,但外层又有一个SubscriptionRunner,究竟SubscriptionRunner是我们需要的智能体,还是GitHubTrendingWatcher是我们需要的智能体?另外,说它们是智能体,总有点勉强,只有GitHubTrendingWatcher里有一个action——AnalyzeGitHubTrending用了LLM的信息提取能力,其他部分都是用代码实现的,感觉不像是智能体,只是一个“软件体”?!
真正的“订阅智能体”,应该是是人给它一条自然语言的指令,比如“请帮我提取https://github.com/trending的今日榜单,推荐2~3个与LLM相关的仓库,并给出推荐理由,然后在晚上七点半发送给我的微信”,然后“订阅智能体”哐哐哐就去干活了,然后每天晚上七点半会给我的微信发一条消息——当天的GitHub的Trending简报。或者我又有一个新的想法,比如”从https://www.qbitai.com/category/获取量子位今天推送的文章,总结今天的主要资讯,然后在每天下午5点30分发送给我“,“订阅智能体”哐哐哐就去干活了,然后每天下午5点30分会给我的微信发一条消息——当天量子位的主要文章摘要。
实现这样的需求,有两种方案:
方案一:将网页的源代码直接发给LLM,让其总结出我们需要的信息。由于一般情况下,网页中会有大量的与我们主题无关的信息,因此这个方案很多token都是浪费了的。特别地,因为订阅智能体是定时触发,那么这个成本会直线上升。在当前token如此贵的情况下,这种方案的成本会有点高(随着Token的费用快速降低,这个问题可能变得不敏感,但节省token总是一个好的选择)。
方案二:使用浏览器自动化的程序获取网页源代码,让LLM根据用户的需求写一个从网页源代码中提取信息的爬虫代码,执行爬虫代码从网页源代码中提取用户需要的数据,再将数据提交给LLM,生成用户阅读的报告。由于爬虫代码是可以重复利用的,因此该动作消耗的token是一次性的(需要确认!?)。而生成报告时给LLM输入的只是用户需要的数据,因此消耗的token将会少非常多。
选择方案二来实现真正的“订阅智能体”——SubscriptionAssistant,程序流程如下:
-
SubscriptionAssistant(Role)接收用户的原始需求——UserRequirement(Action)。这个是MetaGPT自带的action,目前是空的。
-
SubscriptionAssistant将用户的原始需求分解为智能体运行的需求,包括使用的语言、cron表达式、爬虫需求(爬取的url列表,爬取的信息)、爬取信息后处理的要求——ParseSubscriptionRequirement(Action)
-
SubscriptionAssistant将爬虫需求需求发给另一个智能体CrawlerEngineer(Role)——爬虫工程师
CrawlerEngineer根据的爬虫需求写爬虫代码——WriteCrawlerCode(Action)
-
SubscriptionAssistant执行RunSubscription(Action),完成定时生成订阅报告的任务
调用SubscriptionRunner,执行定时任务。SubscriptionRunner的三个入口参数:
role: SubRole,只有一个动作。SubAction:执行爬虫代码,按照“信息后处理要求”生成给用户的报告。
callback:wxpusher_callback。将SubAction运行后的结果发送给微信。这部分可以重用上一个笔记的代码。
trigger:CronTrigger (这是一个通用的cron触发器类,接收定时触发的条件参数,后续可以重用)接收一个定时触发条件参数spec(CRON表达式),会触发SubscriptionRunner运行一次,包括SubAction和wxpusher_callback。
相比上一个笔记,新增三个部分:
- ParseSubscriptionRequirement(Action)——调用LLM将用户的原始需求分解为智能体运行的需求,将引入ActionNode的概念。
- WriteCrawlerCode(Action)——将爬虫需求输入给LLM,生成爬虫代码
- 以及智能体之间的消息传递机制
第一步,先上ParseSubscriptionRequirement(Action)的代码——调用LLM,实现用户的原始需求到智能体运行需求的分解。
import datetime
import sys
from typing import Optional
from uuid import uuid4
from aiocron import crontab
from metagpt.actions import UserRequirement
from metagpt.actions.action import Action
from metagpt.actions.action_node import ActionNode
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser, any_to_str
from metagpt.utils.parse_html import _get_soup
from pytz import BaseTzInfo
from metagpt.logs import logger
# 先写NODES
LANGUAGE = ActionNode(
key="Language",
expected_type=str,
instruction="Provide the language used in the project, typically matching the user's requirement language.",
example="en_us",
)
CRON_EXPRESSION = ActionNode(
key="Cron Expression",
expected_type=str,
instruction="If the user requires scheduled triggering, please provide the corresponding 5-field cron expression. "
"Otherwise, leave it blank.",
example="",
)
CRAWLER_URL_LIST = ActionNode(
key="Crawler URL List",
expected_type=list[str],
instruction="List the URLs user want to crawl. Leave it blank if not provided in the User Requirement.",
example=["<https://example1.com>", "<https://example2.com>"],
)
PAGE_CONTENT_EXTRACTION = ActionNode(
key="Page Content Extraction",
expected_type=str,
instruction="Specify the requirements and tips to extract from the crawled web pages based on User Requirement.",
example="Retrieve the titles and content of articles published today.",
)
CRAWL_POST_PROCESSING = ActionNode(
key="Crawl Post Processing",
expected_type=str,
instruction="Specify the processing to be applied to the crawled content, such as summarizing today's news.",
example="Generate a summary of today's news articles.",
)
INFORMATION_SUPPLEMENT = ActionNode(
key="Information Supplement",
expected_type=str,
instruction="If unable to obtain the Cron Expression, prompt the user to provide the time to receive subscription "
"messages. If unable to obtain the URL List Crawler, prompt the user to provide the URLs they want to crawl. Keep it "
"blank if everything is clear",
example="",
)
NODES = [
LANGUAGE,
CRON_EXPRESSION,
CRAWLER_URL_LIST,
PAGE_CONTENT_EXTRACTION,
CRAWL_POST_PROCESSING,
INFORMATION_SUPPLEMENT,
]
PARSE_SUBSCRIPTION_REQUIEMENTS_NODE= ActionNode.from_children("ParseSubscriptionReq", NODES)
PARSE_SUBSCRIPTION_REQUIREMENT_TEMPLATE = """
### User Requirement
{requirements}
"""
# 分析订阅需求的Action
class ParseSubscriptionRequirement(Action):
async def run(self, requirements):
requirements = "\\n".join(i.content for i in requirements)
context = PARSE_SUBSCRIPTION_REQUIREMENT_TEMPLATE.format(requirements=requirements)
node = await PARSE_SUBSCRIPTION_REQUIEMENTS_NODE.fill(context=context, llm=self.llm)
return node
上面的代码引入了ActionNode。ActionNode
可以被视为一组动作组成的动作树。根据类内定义,一个动作树的父节点可以访问动作树所有的子动作。也就是说,定义了一个完整的动作树之后,可以从父节点按树的结构顺序执行每一个子动作,从而达到CoT(Chain-of-Thought)效果。
在定义一个ActionNode
动作树之后,需要将该动作树作为参数赋予一个Action
子类,再将其输入到Role
中作为其动作。在这个意义上,一个ActionNode
动作树可以被视为一个内置CoT思考的Action
。
同时,在ActionNode
基类中,也配置了更多格式检查和格式规范工具,让CoT执行过程中,内容的传递更加结构化。这也服务于让MetaGPT框架生成更好、更长、更少Bug的代码这一目的。
以上面的代码为例:
- 首先定义了6个子动作:LANGUAGE-定义项目使用的语言,CRON_EXPRESSION-定义cron表达式, CRAWLER_URL_LIST-定义爬取的url列表, PAGE_CONTENT_EXTRACTION-定义需要提取的内容, CRAWL_POST_PROCESSING-定义提取内容后处理的要求, INFORMATION_SUPPLEMENT-附加处理信息,此处的用法是当用户提供的信息不全时,给出相应的提示。
- 再将6个子动作放入动作列表NODES,此处是以串行的方式放入的,也即“链型”结构。(遗留问题:是否支持其他类型的结构,比如环形、树形?待验证,从ActionNode源代码看目前不支持)
- 根据NODES生成ActionNode ParseSubscriptionReq——PARSE_SUBSCRIPTION_REQUIEMENTS_NODE。
- 在ParseSubscriptionRequirement(Action)中通过fill方法调用——执行传入的prompt,向llm发送消息获取返回内容,同时对特定的格式需要做特定prompt输入和格式约束。并将结果存储在自身中返回。
- 查看MetaGPT源代码,fill函数调用的顺序是fill—simple_fill——_aask_v1。
-
# 根据执行模式的输入,选择只执行当前节点的simple_fill或者执行所有子节点的simple_fill async def fill(self, context, llm, schema="json", mode="auto", strgy="simple"): self.set_llm(llm) self.set_context(context) if self.schema: schema = self.schema if strgy == "simple": return await self.simple_fill(schema=schema, mode=mode) elif strgy == "complex": # 这里隐式假设了拥有children tmp = {} for _, i in self.children.items(): child = await i.simple_fill(schema=schema, mode=mode) tmp.update(child.instruct_content.dict()) cls = self.create_children_class() self.instruct_content = cls(**tmp) return self # 获取自身的context信息(保存着所有与动作执行有关的上下文)作为prompt # 执行prompt,并获取返回,保存到自身的content中 async def simple_fill(self, schema, mode): prompt = self.compile(context=self.context, schema=schema, mode=mode) if schema != "raw": mapping = self.get_mapping(mode) class_name = f"{self.key}_AN" # 这里获取的content是llm返回的源文本,scontent则是结构化处理的文本; # _aask_v1 函数会检查llm返回的内容的结构,如果不符合用户期望的格式,则会调用llm重新生成内容 content, scontent = await self._aask_v1(prompt, class_name, mapping, schema=schema) self.content = content self.instruct_content = scontent else: self.content = await self.llm.aask(prompt) self.instruct_content = None return self @retry( wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6), after=general_after_log(logger), ) async def _aask_v1( self, prompt: str, output_class_name: str, output_data_mapping: dict, system_msgs: Optional[list[str]] = None, schema="markdown", # compatible to original format ) -> (str, BaseModel): """Use ActionOutput to wrap the output of aask""" content = await self.llm.aask(prompt, system_msgs) logger.debug(f"llm raw output:\\n{content}") output_class = self.create_model_class(output_class_name, output_data_mapping) if schema == "json": parsed_data = llm_output_postprecess(output=content, schema=output_class.schema(), req_key=f"[/{TAG}]") else: # using markdown parser parsed_data = OutputParser.parse_data_with_mapping(content, output_data_mapping) logger.debug(f"parsed_data:\\n{parsed_data}") instruct_content = output_class(**parsed_data) return content, instruct_content
-
为了测试ParseSubscriptionRequirement,先定义一个只有一个action的SubscriptionAssistant来测试一下——从测试结果看,如果需求提得不完整,会导致器需求分解错误,整个智能体肯定是完成不了任务的。需要考虑用户原始需求的容错机制!
# 定义订阅助手角色
class SubscriptionAssistant(Role):
"""Complete subscription report regularly according to user requirements."""
name: str = "Grace"
profile: str = "Subscription Assistant"
goal: str = "analyze user subscription requirements to provide personalized subscription services."
constraints: str = "utilize the same language as the User Requirement"
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
self.set_actions([ParseSubscriptionRequirement])
async def _act(self) -> Message:
logger.info(f"{self._setting}: ready to {self.rc.todo}")
response = await self.rc.todo.run(self.rc.history)
msg = Message(
content=response.content,
instruct_content=response.instruct_content,
role=self.profile,
cause_by=self.rc.todo,
sent_from=self,
)
self.rc.memory.add(msg)
return msg
if __name__ == "__main__":
import asyncio
from metagpt.team import Team
team = Team()
team.hire([SubscriptionAssistant()])
# team.run_project("从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
# team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后发送给我")
asyncio.run(team.run())
-
理解一下在_act运行过程中各个变量的类型和值的变化
**print(type(self.rc.memory))** <class 'metagpt.memory.memory.Memory'> **print(self.rc.memory)** storage=[Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我] index=defaultdict(<class 'list'>, {'metagpt.actions.add_requirement.UserRequirement': [Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信 息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我]}) ignore_id=False **print(type(self.rc.history))** <class 'list'> **print(self.rc.history)** [Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我] **print(type(self.rc.todo))** <class '__main__.ParseSubscriptionRequirement'> print(self.rc.todo) ParseSubscriptionRequirement **response = await self.rc.todo.run(self.rc.history)** **print(type(response))** <class 'metagpt.actions.action_node.ActionNode'> **print(response)** ParseSubscriptionReq, <class 'str'>, , , [CONTENT] { "Language": "zh_cn", "Cron Expression": "55 14 * * *", "Crawler URL List": [ "<https://pitchhub.36kr.com/financing-flash>" ], "Page Content Extraction": "获取今天发布的所有初创企业融资信息的标题、链接和时间。", "Crawl Post Processing": "总结今天的融资新闻,并在14:55发送。", "Information Supplement": "" } [/CONTENT], {'Language': Language, <class 'str'>, Provide the language used in the project, typically matching the user's requirement language., en_us, , {}, 'Cron Expression': Cron Expression, <class 'str'>, If the user requires scheduled triggering, please provide the corresponding 5-field cron expression. Otherwise, leave it blank., , , {}, 'Crawler URL List': Crawler URL List, list[str], List the URLs user want to crawl. Leave it blank if not provided in the User Requirement., ['<https://example1.com>', '<https://example2.com>'], , {}, 'Page Content Extraction': Page Content Extraction, <class 'str'>, Specify the requirements and tips to extract from the crawled web pages based on User Requirement., Retrieve the titles and content of articles published today., , {}, 'Crawl Post Processing': Crawl Post Processing, <class 'str'>, Specify the processing to be applied to the crawled content, such as summarizing today's news., Generate a summary of today's news articles., , {}, 'Information Supplement': Information Supplement, <class 'str'>, If unable to obtain the Cron Expression, prompt the user to provide the time to receive subscription messages. If unable to obtain the URL List Crawler, prompt the user to provide the URLs they want to crawl. Keep it blank if everything is clear, , , {}} **msg = Message( content=response.content, instruct_content=response.instruct_content, role=self.profile, cause_by=self.rc.todo, sent_from=self, )** **print(type(msg))** <class 'metagpt.schema.Message'> **print(msg)** Subscription Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今天发布的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''} **self.rc.memory.add(msg)** **print(self.rc.memory)** storage=[Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我, Subscription Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今天发布的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}] index=defaultdict(<class 'list'>, {'metagpt.actions.add_requirement.UserRequirement': [Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取 标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我], '__main__.ParseSubscriptionRequirement': [Subscription Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今天发布的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}]}) ignore_id=False
当输入是"从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我",智能体能够很好的完成任务,运行日志:
2024-05-21 09:41:07.717 | INFO | metagpt.const:get_metagpt_package_root:29 - Package root set to D:\\0GPT\\playground
2024-05-21 09:41:14.539 | INFO | __main__:_act:106 - Grace(Subscription Assistant): ready to ParseSubscriptionRequirement
[CONTENT]
{
"Language": "zh_cn",
"Cron Expression": "55 14 * * *",
"Crawler URL List": [
"<https://pitchhub.36kr.com/financing-flash>"
],
"Page Content Extraction": "获取今天发布的所有初创企业融资信息的标题、链接和时间。",
"Crawl Post Processing": "总结今天的融资新闻,并在14:55发送。",
"Information Supplement": ""
[/CONTENT]
2024-05-21 09:41:19.483 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.008 | Max budget: $10.000 | Current cost: $0.008, prompt_tokens: 463, completion_tokens: 102
当输入是"从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我",智能体自作主张的将url设定为"https://36kr.com",后续估计完成不了任务,运行日志:
2024-05-21 09:49:45.424 | INFO | metagpt.const:get_metagpt_package_root:29 - Package root set to D:\\0GPT\\playground
2024-05-21 09:49:50.708 | INFO | __main__:_act:106 - Grace(Subscription Assistant): ready to ParseSubscriptionRequirement
[CONTENT]
{
"Language": "zh_cn",
"Cron Expression": "55 14 * * *",
"Crawler URL List": [
"<https://36kr.com>"
],
"Page Content Extraction": "抓取今日所有初创企业融资的新闻标题、链接及时间信息。",
"Crawl Post Processing": "整理并总结今天的融资新闻,于14:55发送。",
"Information Supplement": ""
}
[/CONTENT]
2024-05-21 09:49:56.669 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.008 | Max budget: $10.000 | Current cost: $0.008, prompt_tokens: 451, completion_tokens: 99
当输入是"从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后发送给我",智能体第一次报错,第二次运行正确,并在 "Information Supplement"提示 "如需设置定时发送,请提供所需的时间表达式。如需爬取其他网站,请告知相应的URL。",运行日志:
2024-05-21 09:50:37.115 | INFO | metagpt.const:get_metagpt_package_root:29 - Package root set to D:\\0GPT\\playground
2024-05-21 09:50:43.280 | INFO | __main__:_act:106 - Grace(Subscription Assistant): ready to ParseSubscriptionRequirement
[CONTENT]
{
"Language": "zh_cn",
"Cron Expression": "", # 请根据需要提供定时触发的cron表达式。
"Crawler URL List": [
"<https://36kr.com>"
],
"Page Content Extraction": "抓取今日发布的所有初创企业融资新闻,提取标题、链接及时间信息。",
"Crawl Post Processing": "对抓取到的今日融资新闻内容进行总结。",
"Information Supplement": "如需定时接收订阅消息,请提供所需的时间信息。如需爬取其他网站,请提供相应的URL。"
}
[/CONTENT]
2024-05-21 09:50:48.597 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.008 | Max budget: $10.000 | Current cost: $0.008, prompt_tokens: 448, completion_tokens: 131
2024-05-21 09:50:48.607 | WARNING | metagpt.utils.repair_llm_raw_output:run_and_passon:268 - parse json from content inside [CONTENT][/CONTENT] failed at retry 1, exp: Expecting property name enclosed in double quotes: line 3 column 29 (char 55)
2024-05-21 09:50:48.612 | INFO | metagpt.utils.repair_llm_raw_output:repair_invalid_json:237 - repair_invalid_json, raw error: Expecting property name enclosed in double quotes: line 3 column 29 (char 55)
2024-05-21 09:50:48.614 | ERROR | metagpt.utils.common:log_it:554 - Finished call to 'metagpt.actions.action_node.ActionNode._aask_v1' after 5.328(s), this was the 1st time calling it. exp: RetryError[<Future at 0x279b27343a0 state=finished raised JSONDecodeError>]
[CONTENT]
{
"Language": "zh_cn",
"Cron Expression": "",
"Crawler URL List": [
"<https://36kr.com>"
],
"Page Content Extraction": "抓取今日所有初创企业融资的新闻标题、链接及发布时间。",
"Crawl Post Processing": "对今日的融资新闻进行总结。",
"Information Supplement": "**如需设置定时发送,请提供所需的时间表达式。如需爬取其他网站,请告知相应的URL。**"
}
[/CONTENT]
2024-05-21 09:50:54.131 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.016 | Max budget: $10.000 | Current cost: $0.008, prompt_tokens: 448, completion_tokens: 113
第二步,实现智能体CrawlerEngineer(Role)——爬虫工程师,以及根据的爬虫需求写爬虫代码的action——WriteCrawlerCode(Action)
from metagpt.actions.action import Action
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser, any_to_str
from metagpt.utils.parse_html import _get_soup
from pytz import BaseTzInfo
from metagpt.logs import logger
from subscriptionassistant import ParseSubscriptionRequirement
# 辅助函数: 获取html css大纲视图
def get_outline(page):
soup = _get_soup(page.html)
outline = []
def process_element(element, depth):
name = element.name
if not name:
return
if name in ["script", "style"]:
return
element_info = {"name": element.name, "depth": depth}
if name in ["svg"]:
element_info["text"] = None
outline.append(element_info)
return
element_info["text"] = element.string
# Check if the element has an "id" attribute
if "id" in element.attrs:
element_info["id"] = element["id"]
if "class" in element.attrs:
element_info["class"] = element["class"]
outline.append(element_info)
for child in element.children:
process_element(child, depth + 1)
for element in soup.body.children:
process_element(element, 1)
return outline
CRAWLER_CODE_PROMPT_TEMPLATE = """Please complete the web page crawler parse function to achieve the User Requirement. The parse \
function should take a BeautifulSoup object as input, which corresponds to the HTML outline provided in the Context.
```python
from bs4 import BeautifulSoup
# only complete the parse function
def parse(soup: BeautifulSoup):
...
# Return the object that the user wants to retrieve, don't use print
```
## User Requirement
{requirement}
## Context
The outline of html page to scrabe is show like below:
```tree
{outline}
```
"""
# 写爬虫代码的Action
class WriteCrawlerCode(Action):
async def run(self, requirement):
requirement: Message = requirement[-1]
data = requirement.instruct_content.dict()
urls = data["Crawler URL List"]
query = data["Page Content Extraction"]
codes = {}
for url in urls:
codes[url] = await self._write_code(url, query)
return "\n".join(f"# {url}\n{code}" for url, code in codes.items())
async def _write_code(self, url, query):
page = await WebBrowserEngine().run(url)
outline = get_outline(page)
outline = "\n".join(
f"{' '*i['depth']}{'.'.join([i['name'], *i.get('class', [])])}: {i['text'] if i['text'] else ''}"
for i in outline
)
code_rsp = await self._aask(CRAWLER_CODE_PROMPT_TEMPLATE.format(outline=outline, requirement=query))
code = CodeParser.parse_code(block="", text=code_rsp)
return code
# 定义爬虫工程师角色
class CrawlerEngineer(Role):
name: str = "John"
profile: str = "Crawling Engineer"
goal: str = "Write elegant, readable, extensible, efficient code"
constraints: str = "The code should conform to standards like PEP8 and be modular and maintainable"
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
self.set_actions([WriteCrawlerCode])
self._watch([ParseSubscriptionRequirement])
首先,要理解两个Role(Agent)之间的消息传递机制。
SubscriptionAssistant执行完ParseSubscriptionRequirement action后,被CrawlerEngineer的_watch函数检测到,触发CrawlerEngineer会执行自己的action队列中的action,此处即WriteCrawlerCode。
再深入下去,需要继承上面“理解一下在_act运行过程中各个变量的类型和值的变化”,来理解执行WriteCrawlerCode动作(run)时如何传递的参数。
print(type(requirement))
<class 'list'>
print(requirement) # 是Memory里的Message列表,储存了完整的各个Human和所有Agent的生成的Message
[Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我, Subscription
Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['https://pitchhub.36kr.com/financing-flash'], 'Page Content Extraction': '获取今
天的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}]
requirement: Message = requirement[-1]
print(type(requirement))
<class 'metagpt.schema.Message'>
print(requirement) # WriteCrawlerCode取Memory中的最后一个Message,即SubscriptionAssistant执行完ParseSubscriptionRequirement后生成的Message,里面带有WriteCrawlerCode需要的入参,含在instruct_content中(遗留问题:为什么不是在content中?)
Subscription Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['https://pitchhub.36kr.com/financing-flash'], 'Page Content Extraction': '获取今天的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}
data = requirement.instruct_content.dict() # 将instruct_content转化为dict类型,可以用dict来进行操作了
回忆一下Message的定义:
class Message(BaseModel):
"""list[<role>: <content>]"""
id: str = Field(default="", validate_default=True) # According to Section 2.2.3.1.1 of RFC 135
content: str
instruct_content: Optional[BaseModel] = Field(default=None, validate_default=True)
role: str = "user" # system / user / assistant
cause_by: str = Field(default="", validate_default=True)
sent_from: str = Field(default="", validate_default=True)
send_to: set[str] = Field(default={MESSAGE_ROUTE_TO_ALL}, validate_default=True)
print(type(data))
<class 'dict'>
print(data)
{'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['https://pitchhub.36kr.com/financing-flash'], 'Page Content Extraction': '获取今天的所有初
创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}
将目前已经定义好的两个智能体协同起来工作如下——写一个简单的爬虫代码消耗了8K tokens!
import asyncio
from metagpt.team import Team
from crawlerengineer import CrawlerEngineer
from subscriptionassistant import SubscriptionAssistant
team = Team()
team.hire([SubscriptionAssistant(), CrawlerEngineer()])
team.run_project("从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
# team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
# team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后发送给我")
asyncio.run(team.run())
2024-05-21 13:12:38.328 | INFO | metagpt.roles.role:_act:391 - John(Crawling Engineer): to do WriteCrawlerCode(WriteCrawlerCode)
```python
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the results
results = []
# Find all the div elements with class 'item-title'
for item in soup.find_all('div', class_='item-title'):
# Extract the title, link, and time
title = item.a.get_text()
link = item.a['href']
time = item.find_next_sibling('div', class_='item-other').span.get_text()
# Append the extracted data to the results list
results.append({'title': title, 'link': link, 'time': time})
return results
```
This function will parse the provided HTML content and extract the title, link, and time for each financing news item. It returns a list of dictionaries, where each dictionary contains the title, link, and time for a news item.
2024-05-21 13:13:06.757 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.116 | Max budget: $10.000 | Current cost: $0.108, prompt_tokens: 7558, completion_tokens: 188
第三步,实现RunSubscription(Action)——完成定时执行爬虫代码、生成订阅报告的任务,并把它加入到SubscriptionAssistant的action列表中,同时建立起触发运行RunSubscription动作的条件——也即在CrawlerEngineer完成WriteCrawlerCode之后。
下面为完整的SubscriptionAssistant(Role)的代码:
import datetime
import sys
from typing import Optional
from uuid import uuid4
import os
import aiohttp
from aiocron import crontab
from metagpt.actions import UserRequirement
from metagpt.actions.action import Action
from metagpt.actions.action_node import ActionNode
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser, any_to_str
from metagpt.utils.parse_html import _get_soup
from pytz import BaseTzInfo
from metagpt.logs import logger
# 先写NODES
LANGUAGE = ActionNode(
key="Language",
expected_type=str,
instruction="Provide the language used in the project, typically matching the user's requirement language.",
example="en_us",
)
CRON_EXPRESSION = ActionNode(
key="Cron Expression",
expected_type=str,
instruction="If the user requires scheduled triggering, please provide the corresponding 5-field cron expression. "
"Otherwise, leave it blank.",
example="38 17 * * *", # 增加一个例子,能够显著提高cron表达式生成的正确性
)
CRAWLER_URL_LIST = ActionNode(
key="Crawler URL List",
expected_type=list[str],
instruction="List the URLs user want to crawl. Leave it blank if not provided in the User Requirement.",
example=["https://example1.com", "https://example2.com"],
)
PAGE_CONTENT_EXTRACTION = ActionNode(
key="Page Content Extraction",
expected_type=str,
instruction="Specify the requirements and tips to extract from the crawled web pages based on User Requirement.",
example="Retrieve the titles and content of articles published today.",
)
CRAWL_POST_PROCESSING = ActionNode(
key="Crawl Post Processing",
expected_type=str,
instruction="Specify the processing to be applied to the crawled content, such as summarizing today's news.",
example="Generate a summary of today's news articles.",
)
INFORMATION_SUPPLEMENT = ActionNode(
key="Information Supplement",
expected_type=str,
instruction="If unable to obtain the Cron Expression, prompt the user to provide the time to receive subscription "
"messages. If unable to obtain the URL List Crawler, prompt the user to provide the URLs they want to crawl. Keep it "
"blank if everything is clear",
example="",
)
NODES = [
LANGUAGE,
CRON_EXPRESSION,
CRAWLER_URL_LIST,
PAGE_CONTENT_EXTRACTION,
CRAWL_POST_PROCESSING,
INFORMATION_SUPPLEMENT,
]
PARSE_SUBSCRIPTION_REQUIREMENTS_NODE = ActionNode.from_children("ParseSubscriptionReq", NODES)
PARSE_SUBSCRIPTION_REQUIREMENT_TEMPLATE = """
### User Requirement
{requirements}
"""
# 分析订阅需求的Action
class ParseSubscriptionRequirement(Action):
async def run(self, requirements):
requirements = "\n".join(i.content for i in requirements)
context = PARSE_SUBSCRIPTION_REQUIREMENT_TEMPLATE.format(requirements=requirements)
node = await PARSE_SUBSCRIPTION_REQUIREMENTS_NODE.fill(context=context, llm=self.llm)
return node
# 触发器:crontab
class CronTrigger:
def __init__(self, spec: str, tz: Optional[BaseTzInfo] = None) -> None:
segs = spec.split(" ")
if len(segs) == 6:
spec = " ".join(segs[1:])
self.crontab = crontab(spec, tz=tz)
def __aiter__(self):
return self
async def __anext__(self):
await self.crontab.next()
return Message(datetime.datetime.now().isoformat())
# callback函数,向微信推送消息
class WxPusherClient:
def __init__(self, token: Optional[str] = None, base_url: str = "http://wxpusher.zjiecode.com"):
self.base_url = base_url
self.token = token or os.environ["WXPUSHER_TOKEN"]
async def send_message(
self,
content,
summary: Optional[str] = None,
content_type: int = 1,
topic_ids: Optional[list[int]] = None,
uids: Optional[list[int]] = None,
verify: bool = False,
url: Optional[str] = None,
):
payload = {
"appToken": self.token,
"content": content,
"summary": summary,
"contentType": content_type,
"topicIds": topic_ids or [],
"uids": uids or os.environ["WXPUSHER_UIDS"].split(","),
"verifyPay": verify,
"url": url,
}
url = f"{self.base_url}/api/send/message"
return await self._request("POST", url, json=payload)
async def _request(self, method, url, **kwargs):
async with aiohttp.ClientSession() as session:
async with session.request(method, url, **kwargs) as response:
response.raise_for_status()
return await response.json()
async def wxpusher_callback(msg: Message):
client = WxPusherClient()
await client.send_message(msg.content, content_type=3)
SUB_ACTION_TEMPLATE = """
## Requirements
Answer the question based on the provided context {process}. If the question cannot be answered, please summarize the context.
## context
{data}"
"""
# 运行订阅智能体的Action
class RunSubscription(Action):
async def run(self, msgs):
from metagpt.roles.role import Role # 注意需要再此处import用到的几个module
from metagpt.subscription import SubscriptionRunner
import asyncio
code = msgs[-1].content
req = msgs[-2].instruct_content.dict()
urls = req["Crawler URL List"]
process = req["Crawl Post Processing"]
spec = req["Cron Expression"]
SubAction = self.create_sub_action_cls(urls, code, process)
SubRole = type("SubRole", (Role,), {})
role = SubRole()
role.set_actions([SubAction])
runner = SubscriptionRunner()
callbacks = []
callbacks.append(wxpusher_callback)
async def callback(msg):
await asyncio.gather(*(call(msg) for call in callbacks))
await runner.subscribe(role, CronTrigger(spec), callback)
await runner.run()
@staticmethod
def create_sub_action_cls(urls: list[str], code: str, process: str):
modules = {}
for url in urls[::-1]:
code, current = code.rsplit(f"# {url}", maxsplit=1)
name = uuid4().hex
module = type(sys)(name)
exec(current, module.__dict__)
modules[url] = module
class SubAction(Action):
async def run(self, *args, **kwargs):
pages = await WebBrowserEngine().run(*urls)
if len(urls) == 1:
pages = [pages]
data = []
for url, page in zip(urls, pages):
# data.append(getattr(modules[url], "parse")(page.soup))
try:
data.append(getattr(modules[url], "parse")(page.soup))
except AttributeError:
# Handle the exception here
print("Error: 'NoneType' object has no attribute 'text'")
return await self.llm.aask(SUB_ACTION_TEMPLATE.format(process=process, data=data))
return SubAction
# 定义订阅助手角色
class SubscriptionAssistant(Role):
"""Complete subscription report regularly according to user requirements."""
name: str = "Grace"
profile: str = "Subscription Assistant"
goal: str = "analyze user subscription requirements to provide personalized subscription services."
constraints: str = "utilize the same language as the User Requirement"
def __init__(self, **kwargs) -> None:
from crawlerengineer import WriteCrawlerCode
super().__init__(**kwargs)
self.set_actions([ParseSubscriptionRequirement, RunSubscription])
self._watch([UserRequirement, WriteCrawlerCode])
async def _think(self) -> bool: # 需要重写_think,处理_watch到不同结果时的响应策略
from crawlerengineer import WriteCrawlerCode # 此处需要import
cause_by = self.rc.history[-1].cause_by
if cause_by == any_to_str(UserRequirement):
state = 0
elif cause_by == any_to_str(WriteCrawlerCode):
state = 1
if self.rc.state == state:
self.rc.todo = None
return False
self._set_state(state)
return True
async def _act(self) -> Message:
logger.info(f"{self._setting}: ready to {self.rc.todo}")
response = await self.rc.todo.run(self.rc.history)
msg = Message(
content=response.content,
instruct_content=response.instruct_content,
role=self.profile,
cause_by=self.rc.todo,
sent_from=self,
)
self.rc.memory.add(msg)
return msg
继承上面“理解一下在_act运行过程中各个变量的类型和值的变化”,以及“理解执行WriteCrawlerCode动作时如何传递的参数”,来理解RunSubscription动作(run)时如何传递的参数。(遗留问题:Message中的content和instruct_content有什么区别?)
**print(type(msgs))**
<class 'list'>
**print(msgs) # 是Memory里的Message列表,储存了完整的各个Human和所有Agent的生成的Message**
[Human: 从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我, Subscription
Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今
天的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}, Crawling Engineer: # <https://pitchhub.36kr.com/financing-flash>
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the parsed data
financing_info = []
# Find all the div elements with class 'item-title'
item_titles = soup.find_all('div', class_='item-title')
# Iterate through each item-title div
for item in item_titles:
# Extract the title, link, and time from each item
title = item.a.get_text()
link = item.a['href']
time = item.find_next_sibling('div', class_='item-other').find('span', class_='time').get_text()
# Append the extracted data to the list
financing_info.append({
'title': title,
'link': link,
'time': time
})
return financing_info
]
**print(msgs[-1])**
Crawling Engineer: # <https://pitchhub.36kr.com/financing-flash>
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the parsed data
financing_info = []
# Find all the div elements with class 'item-title'
item_titles = soup.find_all('div', class_='item-title')
# Iterate through each item-title div
for item in item_titles:
# Extract the title, link, and time from each item
title = item.a.get_text()
link = item.a['href']
time = item.find_next_sibling('div', class_='item-other').find('span', class_='time').get_text()
# Append the extracted data to the list
financing_info.append({
'title': title,
'link': link,
'time': time
})
return financing_info
****
**print(msgs[-1].content)**
# <https://pitchhub.36kr.com/financing-flash>
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the parsed data
financing_info = []
# Find all the div elements with class 'item-title'
item_titles = soup.find_all('div', class_='item-title')
# Iterate through each item-title div
for item in item_titles:
# Extract the title, link, and time from each item
title = item.a.get_text()
link = item.a['href']
time = item.find_next_sibling('div', class_='item-other').find('span', class_='time').get_text()
# Append the extracted data to the list
financing_info.append({
'title': title,
'link': link,
'time': time
})
return financing_info
**print(msgs[-2])**
Subscription Assistant: {'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今天的所有初创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}
****
**code = msgs[-1].content**
**print(type(code))**
<class 'str'>
****
**print(code)**
# <https://pitchhub.36kr.com/financing-flash>
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the parsed data
financing_info = []
# Find all the div elements with class 'item-title'
item_titles = soup.find_all('div', class_='item-title')
# Iterate through each item-title div
for item in item_titles:
# Extract the title, link, and time from each item
title = item.a.get_text()
link = item.a['href']
time = item.find_next_sibling('div', class_='item-other').find('span', class_='time').get_text()
# Append the extracted data to the list
financing_info.append({
'title': title,
'link': link,
'time': time
})
return financing_info
****
**req = msgs[-2].instruct_content.dict()**
**print(type(req))**
<class 'dict'>
****
**print(req)**
{'Language': 'zh_cn', 'Cron Expression': '55 14 * * *', 'Crawler URL List': ['<https://pitchhub.36kr.com/financing-flash>'], 'Page Content Extraction': '获取今天的所有初
创企业融资信息的标题、链接和时间。', 'Crawl Post Processing': '总结今天的融资新闻,并在14:55发送。', 'Information Supplement': ''}
定义一个运行程序,测试效果:
import asyncio
from metagpt.team import Team
from crawlerengineer import CrawlerEngineer
from subscriptionassistant import SubscriptionAssistant
team = Team()
team.hire([SubscriptionAssistant(), CrawlerEngineer()])
team.run_project("从 <https://www.jiqizhixin.com/> 获取机器之心今天推送的文章,总结今天的主要资讯,然后在每天下午3点35分发送给我")
# team.run_project("从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
# team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:55发送给我")
# team.run_project("从36kr创投平台爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后发送给我")
asyncio.run(team.run())
运行日志:
2024-05-21 15:31:49.882 | INFO | metagpt.const:get_metagpt_package_root:29 - Package root set to D:\0GPT\playground
2024-05-21 15:31:55.472 | INFO | subscriptionassistant:_act:246 - Grace(Subscription Assistant): ready to ParseSubscriptionRequirement
[CONTENT]
{
"Language": "zh_cn",
"Cron Expression": "35 15 * * *",
"Crawler URL List": [
"https://www.jiqizhixin.com/"
],
"Page Content Extraction": "获取机器之心今日推送的文章标题和内容。",
"Crawl Post Processing": "总结今天的主要资讯,形成摘要。",
"Information Supplement": ""
}
[/CONTENT]
2024-05-21 15:32:00.295 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.008 | Max budget: $10.000 | Current cost: $0.008, prompt_tokens: 460, completion_tokens: 93
2024-05-21 15:32:00.309 | INFO | metagpt.roles.role:_act:391 - John(Crawling Engineer): to do WriteCrawlerCode(WriteCrawlerCode)
```python
from bs4 import BeautifulSoup
def parse(soup: BeautifulSoup):
# Initialize an empty list to store the article titles and summaries
articles = []
# Find all the article containers in the soup
for article in soup.find_all('article', class_='article-item__container'):
# Extract the title and summary
title = article.find('a', class_='article-item__title t-strong js-open-modal').text.strip()
summary = article.find('p', class_='u-text-limit--two article-item__summary').text.strip()
# Append the title and summary to the articles list
articles.append({
'title': title,
'summary': summary
})
return articles
```
This function will return a list of dictionaries, where each dictionary contains the title and summary of an article.
2024-05-21 15:32:21.725 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.118 | Max budget: $10.000 | Current cost: $0.110, prompt_tokens: 7708, completion_tokens: 176
2024-05-21 15:32:21.737 | INFO | subscriptionassistant:_act:246 - Grace(Subscription Assistant): ready to RunSubscription
2024-05-21 15:35:00.007 | INFO | metagpt.roles.role:_act:391 - (): to do SubAction(SubAction)
今天的主要资讯摘要如下:
1. 联想集团推出了搭载高通骁龙X Elite处理器的下一代Copilot+ PC,包括联想Yoga Slim 7x和联想ThinkPad T14s Gen 6。
2. 火山引擎官网更新了豆包大模型的定价详情,展示了不同版本和规格的价格信息,支持国内最高并发标准。
3. WOT大会日程上线,将邀请数十位大模型实践企业分享经验,探讨大模型在企业级场景中的应用挑战。
4. 意大利罗马第二大学的研究人员提出了一种基于扩散模型的机器学习方法,用于在高雷诺数的三维湍流中生成单粒子轨迹。
5. 《Nature》发表文章,探讨如何通过心理学和神经科学来破解AI大模型的“思考”过程。
6. 一款AI视频换脸神器可以让安吉丽娜·朱莉瞬间变成“女版”马斯克。
7. 2024世界人工智能大会将于7月初在上海举办,持续广泛征集“人工智能+”应用场景。
8. 腾讯助力大模型进入“实用”时代,5分钟创建智能助手。
9. Karpathy称赞LLaMa3项目,从零开始实现,半天内获得1.5k个Star。
10. 首个GPU高级语言获得了8500个Star,支持大规模并行处理。
11. 一项研究探讨了数据量与数据质量在计算预算有限的情况下的选择问题。
以上就是今天的资讯摘要。
2024-05-21 15:35:18.751 | INFO | metagpt.utils.cost_manager:update_cost:57 - Total running cost: $0.014 | Max budget: $10.000 | Current cost: $0.014, prompt_tokens: 657, completion_tokens: 309
解析用户需求用了553 tokens,写代码用了7884 tokens,生成一天的报告用了966 tokens。
如果将HTML文档直接扔给LLM生成报告,估计每天需要花费9000 tokens左右,而现在的方案以后每天只需要不到1000 tokens,只用了10%多一点的成本。还算不错!