Tarsier 使用教程

温姬尤Lee

于 2025-03-31 09:43:26 发布

阅读量260

点赞数 5

本文链接：https://blog.csdn.net/gitblog_00711/article/details/146800313

版权

Tarsier 使用教程

tarsier Vision utilities for web interaction agents 👀 项目地址: https://gitcode.com/gh_mirrors/tarsie/tarsier

1. 项目介绍

Tarsier 是由 Reworkd 开发的一个开源项目，旨在为 web 交互代理提供视觉工具。这个项目通过为页面上的交互元素添加标签，并将这些标签与 LLM（大型语言模型）的动作映射起来，从而实现对网页的自动化交互。Tarsier 还包含一个 OCR 算法，可以将网页截图转换为类似 ASCII 艺术的字符串表示，使得即使是缺少视觉能力的 LLM 也能理解页面内容。

2. 项目快速启动

首先，确保您的环境中已经安装了 Python。接着，使用以下命令安装 Tarsier：

pip install tarsier

接下来，您需要一个异步的浏览器自动化库，如 Playwright。以下是一个简单的示例，展示如何使用 Tarsier 和 Playwright 一起工作：

import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    # 加载 OCR 服务的凭证
    google_cloud_credentials = {
        'key': 'YOUR_GOOGLE_CLOUD_API_KEY',
        'endpoint': 'YOUR_GOOGLE_CLOUD_API_ENDPOINT'
    }
    
    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    # 启动 Playwright 浏览器
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto('https://news.ycombinator.com')

        # 获取页面文本和标签到 XPath 的映射
        page_text, tag_to_xpath = await tarsier.page_to_text(page)
        
        # 打印标签到 XPath 的映射
        print(tag_to_xpath)
        
        # 打印页面的文本表示
        print(page_text)

if __name__ == '__main__':
    asyncio.run(main())

确保替换 'YOUR_GOOGLE_CLOUD_API_KEY' 和 'YOUR_GOOGLE_CLOUD_API_ENDPOINT' 为您的 Google Cloud API 凭证。

3. 应用案例和最佳实践

自动化 Web 交互

使用 Tarsier，您可以自动化网页上的常见交互任务，例如点击按钮、填写表单等。以下是一个自动化登录流程的示例：

# ...（前面的代码）
await page.fill('input[name="username"]', 'your_username')
await page.fill('input[name="password"]', 'your_password')
await page.click('button#login')
# ...（后面的代码）