browser-use 深度技术解析，开启智能体网页浏览新时代

最新推荐文章于 2025-05-08 07:52:43 发布

栋搞西搞

最新推荐文章于 2025-05-08 07:52:43 发布

阅读量1.1k

点赞数 12

文章标签：学习人工智能语言模型 ai

本文链接：https://blog.csdn.net/thomas20/article/details/147197697

版权

一、基本介绍

1. 项目背景

近期，Manus 以及 OpenAI 的 DeepSearch 产品热度飙升，顺势带火了一个名为 browser-use 的 Python 工具库。值得一提的是，该库获得了诸多资本的青睐，像 YC 等公司为其注入了 1700 万美元的融资，发展态势迅猛。据悉，创始人 Müller 仅用了一个周末便完成了想法验证。起初，他们采用开源方式助力社区，随后 browser-use 便声名大噪。由此可见，在 AI 社区中，快速迭代验证想法并开源共享是极为有效的开发模式。

此次进行技术分享，主要是面向可能会在智能体产品或团队中应用 browser-use 的相关人员，期望能对大家有所助益，同时也期待大家后续能开展更多相关技术分享。鉴于市场上已有众多类似 browser-use 的使用教程（文末参考资料中会附上相关链接），本次技术分享将重点聚焦于 browser-use 的技术原理解析，以及我个人发掘的一些有趣技术点。

2. 解决的问题

browser-use 工具主要是为了实现智能体轻松浏览网页

有人或许会问，为何不能直接使用 web search？其实，早在早期，AI 助手就能通过接入搜索引擎获取最新网络信息。

然而，web search 存在显著弊端：

信息获取受限：我们本质上调用的是搜索引擎 API，并非原始互联网数据。这就导致信息相对有限，像携程订票的最新价格、淘宝商品的最新信息等大量站内信息，搜索引擎往往无法获取，这类垂直站内信息需进入各站点内部方可获取。
功能单一：在实际操作场景中，如购买火车票等，web search 功能显得力不从心，它仅能用于搜索基本信息，而 browser-use 则更为便捷，通过模拟人类使用网页的方式，从而解锁更多的功能。

3. 问题解决途径

基于 playwright 库实现：playwright 原本用于网页端到端测试，是一个 Python 库，主要用于自动化 Web 浏览器测试与爬虫工作。

具体实现方式

纯文本方式：browser-use 库通过解析网页文件（如 html/css/javascript 代码），将获取的页面元素以文本形式提供给大语言模型（LLM）。这种方式通用性强，对模型要求不高，只要模型支持函数调用，或具备较强的指令跟随、格式遵守能力即可。例如，它能与包括 GPT-4、Claude 3 和 Llama 2 在内的所有 LangChain 大语言模型兼容。
文本 + 视觉方式：对于支持多模态的模型，同时提供图片与页面元素能显著提升效果，比如多模态 LLM 可以阅读网页中的图片数据，而纯文本的 LLM 无法做到。

4. 收费策略

除了开源免费版本之外，官方还提供了付费服务，主要是提供一些技术咨询和私有化部署上的支持。对于大部分使用场景来说，开源免费版本已经完全满足我们的需求了。

二、效果实测

以下为我改写过的测试代码，其主要任务是：访问栋搞西搞-CSDN博客，并浏览第一篇文章，将文章内容概要列出来

from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio
from pydantic import SecretStr
from browser_use import BrowserConfig, Browser
import base64
import os

# Basic configuration
config = BrowserConfig(
    # 如果希望使用无头浏览器，需要设置为 True
    headless=False,
)

browser = Browser(config=config)

# 这里面我使用的是火山引擎的DeepSeek V3模型，你可以换成你需要希望使用的模型接口提供商，并替换 model name和 api key
llm=ChatOpenAI(base_url='https://ark.cn-beijing.volces.com/api/v3',
               model='{your ark model endpoint ID}',
               api_key=SecretStr('{your api key}'))


def save_base64_image(base64_data, output_path):
    """
    将base64编码的图片保存到本地指定路径

    参数:
        base64_data (str): base64编码的图片数据，可以包含或不包含header (如 'data:image/jpeg;base64,')
        output_path (str): 保存图片的完整路径，包括文件名和扩展名

    返回:
        bool: 保存成功返回True，否则返回False
    """
    try:
        # 如果base64_data包含header，则去除header
        if ';base64,' in base64_data:
            base64_data = base64_data.split(';base64,')[1]

        # 确保输出目录存在
        output_dir = os.path.dirname(output_path)
        if output_dir and not os.path.exists(output_dir):
            os.makedirs(output_dir)

        # 解码base64数据并写入文件
        with open(output_path, 'wb') as f:
            f.write(base64.b64decode(base64_data))

        return True
    except Exception as e:
        print(f"保存图片时出错: {e}")
        return False

async def main():
    agent = Agent(
        browser=browser,
        task="访问 https://blog.csdn.net/thomas20，并访问第一篇文章，将文章内容概要列出来",
        llm=llm,
        use_vision=False, # 如果使用的是非多模态模型，必须设置为False
        save_conversation_path="logs/conversation"  # Save chat logs
    )

    history = await agent.run()

    # Example of accessing history
    # Access (some) useful information
    # 保存运行过程中的截图
    for i, imgStr in enumerate(history.screenshots()):
        save_base64_image(imgStr, f'logs/{i}.png')

    # 打印一些重要的信息
    print(f'history.urls(): {history.urls()}')  # List of visited URLs
    print(f'history.action_names(): {history.action_names()}') # Names of executed actions
    print(f'history.extracted_content(): {history.extracted_content()}') # Content extracted during execution
    print(f'history.errors(): {history.errors()}') # Any errors that occurred
    print(f'history.model_actions(): {history.model_actions()}') # All actions with their parameters

asyncio.run(main())

核心代码解析：

在上述代码中，我选用了火山云引擎的 DeepSeek V3 模型，这主要是由于国内无法访问 OpenAI 接口。各位在实际使用时，可根据自身习惯更换为其他模型接口提供商，并相应替换 model name（模型名称）和 api key（应用程序编程接口密钥）。
代码执行后，默认会弹出一个浏览器，用以展示模型访问浏览器的实时操作页面。若希望使用无头浏览器，设置 headless=True 即可
我专门新增了一个 save_base64_image 函数，该函数用于存储代码运行过程中的网页截图。
通过设置 save_conversation_path="logs/conversation"，能够保存智能体（agent）的对话记录，这有助于我们深入理解 browser-use 的技术原理。
不少人在使用不支持多模态的模型时会遇到报错情况。这是因为官方默认配置适用于支持多模态的模型。此时，只需将 Agent 参数设置为 use_vision = False，便可解决该问题。

基于提供以上的代码示例，以下是测试过程的截图:

Step1: 打开指定网址

Step 2: 提取网页信息

以下是最终输出的结果:

“The article titled 'Cursor 深度体验如何通过AI重塑你的编程体验' by 栋搞西搞 was published on 2024-09-04. It discusses the AI-powered code editor Cursor, highlighting its unique features and advantages over other tools like GitHub Copilot. Key points include its recent popularity due to endorsements and a case study of an 8-year-old using it to develop a web app. The author strongly recommends Cursor for improving programming efficiency and experience”

虽然从结果上来说，总结的文档概要信息还是比较到位，但我发现一个问题：虽然我的任务描述是中文的，但是最终输出的结论是英文形式的。这个可能是因为 browser-use 内置的默认 prompt 大部分是英文写的导致的。如果在生产环境使用，可能需要进行优化。

总的来说，browser-use 库的使用方法非常简单，网页信息提取的效果也还不错。

三、技术原理解析

通过运行以上示例，我们也可以看到 agent 的原始对话日志，帮助我们理解 browser-use 本身的技术原理。对话记录里面包含了 System Prompt，以及 agent 的历史操作记录等信息。

1. System Prompt

虽然完整的 System Prompt 内容很长，但是我觉得能够帮助理解技术原理，所以我还是直接贴出来了。

https://github.com/browser-use/browser-use/blob/main/browser_use/agent/system_prompt.md

You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.

# Input Format
Task
Previous steps
Current URL
Open Tabs
Interactive Elements
[index]<type>text</type>
- index: Numeric identifier for interaction
- type: HTML element type (button, input, etc.)
- text: Element description
Example:
[33]<button>Submit Form</button>

- Only elements with numeric indexes in [] are interactive
- elements without [] provide only context

# Response Rules
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
{{"current_state": {{"evaluation_previous_goal": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not",
"memory": "Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz",
"next_goal": "What needs to be done with the next immediate action"}},
"action":[{{"one_action_name": {{// action-specific parameter}}}}, // ... more actions in sequence]}}

2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. Use maximum {max_actions} actions per sequence.
Common action sequences:
- Form filling: [{{"input_text": {{"index": 1, "text": "username"}}}}, {{"input_text": {{"index": 2, "text": "password"}}}}, {{"click_element": {{"index": 3}}}}]
- Navigation and extraction: [{{"go_to_url": {{"url": "https://example.com"}}}}, {{"extract_content": {{"goal": "extract the names"}}}}]
- Actions are executed in the given order
- If the page changes after an action, the sequence is interrupted and you get the new state.
- Only provide the action sequence until an action which changes the page state significantly.
- Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page
- only use multiple actions if it makes sense.

3. ELEMENT INTERACTION:
- Only use indexes of the interactive elements
- Elements marked with "[]Non-interactive text" are non-interactive

4. NAVIGATION & ERROR HANDLING:
- If no suitable elements exist, use other functions to complete the task
- If stuck, try alternative approaches - like going back to a previous page, new search, new tab etc.
- Handle popups/cookies by accepting or closing them
- Use scroll to find elements you are looking for
- If you want to research something, open a new tab instead of using the current tab
- If captcha pops up, try to solve it - else try a different approach
- If the page is not fully loaded, use wait action

5. TASK COMPLETION:
- Use the done action as the last action as soon as the ultimate task is complete
- Dont use "done" before you are done with everything the user asked you, except you reach the last step of max_steps.
- If you reach your last step, use the done action even if the task is not fully finished. Provide all the information you have gathered so far. If the ultimate task is completely finished set success to true. If not everything the user asked for is completed set success in done to false!
- If you have to do something repeatedly for example the task says for "each", or "for all", or "x times", count always inside "memory" how many times you have done it and how many remain. Don't stop until you have completed like the task asked you. Only call done after the last step.
- Don't hallucinate actions
- Make sure you include everything you found out for the ultimate task in the done text parameter. Do not just say you are done, but include the requested information of the task.

6. VISUAL CONTEXT:
- When an image is provided, use it to understand the page layout
- Bounding boxes with labels on their top right corner correspond to element indexes

7. Form filling:
- If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field.

8. Long tasks:
- Keep track of the status and subresults in the memory.
- You are provided with procedural memory summaries that condense previous task history (every N steps). Use these summaries to maintain context about completed actions, current progress, and next steps. The summaries appear in chronological order and contain key information about navigation history, findings, errors encountered, and current state. Refer to these summaries to avoid repeating actions and to ensure consistent progress toward the task goal.

9. Extraction:
- If your task is to find information - call extract_content on the specific pages to get and store the information.
Your responses must be always JSON with the specified format.

可以看到， System Prompt 里面定义了模型的角色，输入输出格式要求等重要信息，这里面我列举部分内容解释下:

Input Format 输入的格式要求
Response Rules 输入的格式要求，包括要求模型输出具体的 json 数据格式要求等等
Actions 描述了允许 agent 执行的动作，常见的有
- input_text 输入文本
- click_element 点击某个元素
- go_to_url 跳转到某个 url 地址
- extract_content 用于提取并总结网页的指定信息
TASK COMPLETION 定义了任务完成的判断标准

2. 历史操作记录

以下是除了 System Prompt 之外，其他的历史操作记录的细节，仅供参考：

HumanMessage 
Your ultimate task is: """访问 https://blog.csdn.net/thomas20，并访问第一篇文章，将文章内容概要列出来.""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.

 HumanMessage 
Example output:

 AIMessage 


 ToolMessage 
Browser started

 HumanMessage 
[Your task history memory starts here]

 AIMessage 


 ToolMessage 


 HumanMessage 
Action result: 🔗  Navigated to https://blog.csdn.net/thomas20

 AIMessage 


 ToolMessage 


 HumanMessage 
Action result: 🖱️  Clicked button with index 54:

 AIMessage 


 ToolMessage 


 HumanMessage 
Action result: 📄  Extracted from page
: ```json
{
  "title": "Cursor 深度体验：如何通过 AI 重塑你的编程体验",
  "author": "栋搞西搞",
  "publication_date": "2024-09-04",
  "views": "5.7k",
  "likes": 24,
  "bookmarks": 31,
  "tags": ["人工智能"],
  "summary": {
    "introduction": "Cursor 是一款基于 AI 的代码编辑器，近期因其出色的 AI 集成和用户体验设计而受到广泛关注。文章分享了作者一周的使用体验，认为 Cursor 在提升编程效率方面表现出色，并带来了一些惊喜。",
    "key_points": [
      {
        "title": "Cursor 的出圈",
        "content": "Cursor 近期因 AI 大佬 Andrej Karpathy 的推荐和一位 8 岁小女孩使用 Cursor 开发网页应用的案例而受到广泛关注。"
      },
      {
        "title": "Cursor 的独特之处",
        "content": "与其他 AI 编程工具（如 GitHub Copilot、商汤代码小浣熊等）相比，Cursor 在 AI 与编程的深度结合上做得更好，用户体验设计尤为出色。"
      },
      {
        "title": "核心优势",
        "content": "Cursor 通过 AI 技术帮助开发者更高效地编写、优化代码，并提供智能化的调试与改进建议。其功能包括代码生成、上下文感知和多语言支持等。"
      }
    ],
    "conclusion": "作者强烈推荐程序员尝试 Cursor，认为它在提升编程效率和体验方面具有显著优势。"
  },
  "images": [
    "https://img-blog.csdnimg.cn/img_convert/a166db45af08f1c3b8b476413bdb5e45.png",
    "https://img-blog.csdnimg.cn/img_convert/8c4b3df4833c34706800d067efde4182.jpeg",
    "https://img-blog.csdnimg.cn/img_convert/6daa9ddfcbffd4ce583c391ef6cd2bf2.png",
    "https://img-blog.csdnimg.cn/img_convert/d3c5a2a9c824461e63ce6d169c27518f.png"
  ],
  "url": "https://blog.csdn.net/thomas20/article/details/141901654"
}
```

 HumanMessage 
[Task history memory ends]
[Current state starts here]
The following is one-time information - if you need to remember it write it to memory:
Current url: https://blog.csdn.net/thomas20/article/details/141901654?spm=1001.2014.3001.5501
Available tabs:
[TabInfo(page_id=0, url='https://blog.csdn.net/thomas20', title='栋搞西搞-CSDN博客'), TabInfo(page_id=1, url='https://blog.csdn.net/thomas20/article/details/141901654?spm=1001.2014.3001.5501', title='Cursor 深度体验：如何通过 AI 重塑你的编程体验_cursor composer-CSDN博客')]
Interactive elements from top layer of the current page inside the viewport:
[Start of page]
[0]<a />
[1]<img />
[2]<a 博客/>
[3]<a 下载/>
[4]<a 学习/>
[5]<a 社区/>
[6]<a GitCode/>
[7]<img />
[8]<a InsCodeAI/>
[9]<a 会议/>
[10]<input text;cursor composer/>
[11]<button 搜索/>
C 知道
[12]<a 登录/>
[13]<a 会员中心/>
[14]<img />
[15]<a 消息/>
[16]<a 历史/>
[17]<a 创作中心/>
[18]<a 创作/>
Cursor 深度体验：如何通过 AI 重塑你的编程体验
[19]<img />
[20]<a />
[21]<img />
[22]<a 栋搞西搞/>
[23]<img />
于 2024-09-04 19:20:36 发布
[24]<img />
阅读量5.7k
[25]<a 收藏
31/>
[26]<img />
[27]<img />
点赞数
                            24
文章标签：
[28]<a 人工智能/>
[29]<a 版权/>
[30]<img />
1.
Cursor
介绍
1.1 Cursor出圈了
最近，AI
代码编辑器
Cursor 变得非常火热，一方面，是因为 AI 大佬Andrej Karpathy 在 X 上推荐了它。
[31]<img />
同时，
国外一位 8 岁的小女孩使用 Cursor 和 Composer 从零开始开发了一个网页应用
，并将过程发布在
[32]<a YouTube/>
上，引起了广泛关注。
[33]<a />
[34]<img />
[35]<a 栋搞西搞/>
博客等级
[36]<img />
码龄12年
[37]<a 18
原创/>
158
点赞
177
收藏
113
粉丝
[38]<a 关注/>
[39]<a 私信/>
[40]<a />
[41]<img />
[42]<a />
[43]<img />
[44]<a 🚀
一键安装管理服务器！
超过1500万台服务器
装宝塔，2分钟安装好！
界面美观！可视化操作！/>
[45]<a 点击了解万维广告联盟>广告/>
[46]<img />
热门文章
[47]<a Cursor 深度体验：如何通过 AI 重塑你的编程体验
5740/>
[48]<img />
[49]<a LLM基础课: 跟着大神 Andrej Karpathy 学习 Byte Pair Encoding
2109/>
[50]<img />
[51]<a 喵星工程师：机器猫的制造日志
2064/>
[52]<img />
[53]<a 论文分享：当AI遇上世界模型，超长上下文中的多模态探险 World Model on Million-Length Video And Language With RingAttention
1799/>
[54]<img />
[55]<a 手把手一起造个机器猫
935/>
[56]<img />
分类专栏
[57]<a 如何写出诗一样优雅的代码/>
[58]<img />
3篇
[59]<a 东搞西搞/>
[60]<img />
1篇
[61]<a Golang 编程技巧/>
[62]<img />
4篇
[63]<a 前辈传记与随想/>
[64]<img />
3篇
最新评论
[65]<a Cursor 深度体验：如何通过 AI 重塑你的编程体验/>
[66]<a 栋搞西搞:/>
上网要讲究科学
[67]<a Cursor 深度体验：如何通过 AI 重塑你的编程体验/>
[68]<a mychinabc:/>
Cursor 在国内可以使用的啊
[69]<a Cursor 深度体验：如何通过 AI 重塑你的编程体验/>
[70]<a CSDN-Ada助手:/>
不知道 Python入门 技能树是否可以帮到你：https://edu.csdn.net/skill/python?utm_source=AI_act_python
[71]<a LLM基础课: 跟着大神 Andrej Karpathy 学习 Byte Pair Encoding/>
[72]<a CSDN-Ada助手:/>
恭喜您发布了第16篇博客！学习 Andrej Karpathy 大神的 Byte Pair Encoding 确实是个不错的选择。接下来，我建议您可以深入研究一些实际应用场景，或者结合其他的编码算法进行比较分析，这样可以更加全面地了解和掌握相关知识。期待您的下一篇作品！愿您在创作的道路上不断进步，谢谢您的分享。
[73]<a 论文分享：当AI遇上世界模型，超长上下文中的多模态探险 World Model on Million-Length Video And Language With RingAttention/>
[74]<a CSDN-Ada助手:/>
恭喜您发布第15篇博客！标题“论文分享：当AI遇上世界模型，超长上下文中的多模态探险”听起来非常专业和引人入胜。对于AI和世界模型的探索，您的文章一定会给读者带来深刻的启发和思考。接下来，我建议您可以考虑进一步深入讨论多模态探险的具体案例，或者探索AI在不同领域的应用，这样能够为读者提供更丰富的内容和启发。期待您更多精彩的创作！
登录后您可以享受以下权益：
[75]<img />
免费复制代码
[76]<img />
和博主大V互动
[77]<img />
下载海量资源
[78]<img />
发动态/写文章/加入社区
×
[79]<button 立即登录/>
[80]<a 点击体验
DeepSeekR1满血版/>
[81]<img />
... 3195 pixels below - scroll or extract content to see more ...
Current step: 4/100Current date and time: 2025-04-13 15:18

 RESPONSE
{
  "current_state": {
    "evaluation_previous_goal": "Success - Extracted the content summary of the article",
    "memory": "Extracted the content summary of the article. The task is now complete.",
    "next_goal": "Complete the task"
  },
  "action": [
    {
      "done": {
        "text": "The article titled 'Cursor \u6df1\u5ea6\u4f53\u9a8c\uff1a\u5982\u4f55\u901a\u8fc7 AI \u91cd\u5851\u4f60\u7684\u7f16\u7a0b\u4f53\u9a8c' by \u680b\u641e\u897f\u641e was published on 2024-09-04. It discusses the AI-powered code editor Cursor, highlighting its unique features and advantages over other tools like GitHub Copilot. Key points include its recent popularity due to endorsements and a case study of an 8-year-old using it to develop a web app. The author strongly recommends Cursor for improving programming efficiency and experience.",
        "success": true
      }
    }
  ]
}

我们可以看到， browser-use 将网页的所有元素解析出来，并逐个标记上了序号，供 LLM 来决策接下来主要执行什么样的 action。

Available tabs:
[TabInfo(page_id=0, url='https://blog.csdn.net/thomas20', title='栋搞西搞-CSDN博客'), TabInfo(page_id=1, url='https://blog.csdn.net/thomas20/article/details/141901654?spm=1001.2014.3001.5501', title='Cursor 深度体验：如何通过 AI 重塑你的编程体验_cursor composer-CSDN博客')]
Interactive elements from top layer of the current page inside the viewport:
[Start of page]
[0]<a />
[1]<img />
[2]<a 博客/>
[3]<a 下载/>

。。。中间数据省略。。。

[80]<a 点击体验
DeepSeekR1满血版/>
[81]<img />

四、后续计划

这次我只测试了 browser-use 库最基本的使用方法。还有些比较高阶的用法，欢迎感兴趣的童鞋尝试下。比如说:

如何在 docker 容器里面运行browser-use 实例，便于支持 agent 在服务器后台运行，而不只是用户的本地电脑环境？
无头浏览器的模式下，如何获取更加实时的屏幕截图数据(目前只能等到任务执行完毕才能获取到)?
测试多模态模型对于效果提升的影响
官方还提供了一个 web-ui 的项目，用于支持可视化地展示 browser-use 的效果(比如支持了录像功能)，也可以试试: https://github.com/browser-use/web-ui

如果有做过以上测试的同学，欢迎分享下相关的进展