UI-TARS

在这里插入图片描述


一、关于 UI-TARS


我们还提供UI-TARS-desktop版本,可以在您的本地个人设备上运行,要使用它,请访问 https://github.com/bytedance/UI-TARS-desktop

要在Web自动化中使用UI-TARS,您可以参考开源项目Midsite.js


⚠️重要公告:GGUF模型性能

GGUF模型经历了量化,但不幸的是,它的性能无法保证。因此,我们决定将其降级。

💡替代解决方案
您可以使用**云部署[本地部署vLLM]**(如果您有足够的GPU资源)。

我们感谢您的理解和耐心,因为我们努力确保获得最佳体验。


概览

在这里插入图片描述


Local Image


核心功能


知觉
  • 全面的GUI理解:处理多模态输入(文本、图像、交互)以建立对界面的连贯理解。
  • 实时交互:持续监控动态GUI并实时准确响应变化。

行动
  • 统一操作空间:跨平台(桌面、移动和Web)的标准化操作定义。
  • 特定于平台的操作:支持热键、长按和特定于平台的手势等附加操作。

推理
  • 系统1和系统2推理:将快速、直观的响应与针对复杂任务的深思熟虑的高级计划相结合。
  • 任务分解和反射:支持多步骤规划、反射和纠错,以实现稳健的任务执行。

记忆
  • 短期记忆:捕捉特定于任务的上下文以获得情境感知。
  • 长期记忆:保留历史互动和知识,以改进决策。

能力

  • 跨平台交互:通过统一的操作框架支持桌面、移动和Web环境。
  • 多步骤任务执行:经过训练,可以通过多步骤轨迹和推理来处理复杂的任务。
  • 从合成数据和真实数据中学习:结合大规模带注释和合成数据集以提高泛化和稳健性。

二、部署


云部署

我们建议使用HuggingFace推理端点进行快速部署。我们提供两个文档供用户参考:

英文版:GUI模型部署指南

中文版: GUI模型部署教程


本地部署[Transformers ]

我们遵循与Qwen2-VL相同的方式,查看此教程以获取更多详细信息。


本地部署[vLLM]

我们建议使用vLLM进行快速部署和推理,需要使用vllm>=0.6.1

pip install -U transformers
VLLM_VERSION=0.6.6
CUDA_VERSION=cu124
pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}

下载模型

我们在Hugging Face 上提供了三种模型尺寸:2B、7B和72B。为了获得最佳性能,我们建议使用7B-DPO或72B-DPO模型(取决于您的GPU配置):


启动OpenAI API服务

运行以下命令以启动与OpenAI兼容的API服务:

python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>

然后你可以使用聊天API如下与gui提示符(从移动或计算机中选择)和base64编码的本地图像(见OpenAI API协议文档了解更多细节),你也可以使用它在UI-TARS-desktop

import base64
from openai import OpenAI

instruction = "search for today's weather"
screenshot_path = "screenshot.png"
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="empty",
)
***
## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 
***
## Output Format
```\nThought: ...
Action: ...\n```
***
## Action Space

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='') #If you want to submit your input, use \"\
\" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.

***
## Note
- Use Chinese in `Thought` part.
- Summarize your next action (with its target element) in one sentence in `Thought` part.
***
## User Instruction
"""

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
response = client.chat.completions.create(
    model="ui-tars",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt + instruction},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
            ],
        },
    ],
    frequency_penalty=1,
    max_tokens=128,
)
print(response.choices[0].message.content)

对于单步接地任务或对Seeclick等接地数据集的推断,请参阅以下脚本:

import base64
from openai import OpenAI

instruction = "search for today's weather"
screenshot_path = "screenshot.png"
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="empty",
)
***
## Below is the prompt for mobile
prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
response = client.chat.completions.create(
    model="ui-tars",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
                {"type": "text", "text": prompt + instruction}
            ],
        },
    ],
    frequency_penalty=1,
    max_tokens=128,
)
print(response.choices[0].message.content)

提示模板

我们目前为稳定运行和性能提供了两个提示模板,一个用于移动场景,一个用于个人电脑场景。

  • 手机提示模板:
## Below is the prompt for mobile
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 
***
## Output Format
```\nThought: ...
Action: ...\n```
***
## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
press_home()
press_back()
finished(content='') # Submit the task regardless of whether it succeeds or fails.
***
## Note
- Use English in `Thought` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
***
## User Instruction
"""

  • 计算机提示模板:
## Below is the prompt for computer
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 
***
## Output Format
```\nThought: ...
Action: ...\n```
***
## Action Space

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='') #If you want to submit your input, use \"\
\" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.

***
## Note
- Use Chinese in `Thought` part.
- Summarize your next action (with its target element) in one sentence in `Thought` part.
***
## User Instruction
"""

本地部署 [Ollama]

Ollama will be coming soon. Please be patient and wait~ 😊


推理结果解释


坐标映射

该模型生成表示相对位置的2D坐标输出。要将这些值转换为图像相对坐标,请将每个组件除以1000以获得[0,1]范围内的值。Action所需的绝对坐标可以通过以下方式计算:

  • X absolute = X relative × image width
  • Y absolute = Y relative × image height

例如,给定屏幕尺寸:1920×1080,模型生成坐标输出为(235,512),X绝对值为round(1920*235/1000)=451,Y绝对值为round(1080*512/1000)=553,绝对坐标为(451,553)


三、用于桌面和Web自动化

要在桌面中体验用户界面-tars代理,您可以参考 UI-TARS-desktop 。我们建议在桌面上使用7B/72BDPO模型

Midsite. js是一个开源的web自动化SDK,已经支持了UI-TARS模型,开发者可以使用javascript和自然语言来控制浏览器,有关设置模型的更多详细信息,请参见本指南



四、表现

Perception Capabilty Evaluation

ModelVisualWebBenchWebSRCSQAshort
Qwen2-VL-7B73.381.884.9
Qwen-VL-Max74.191.178.6
Gemini-1.5-Pro75.488.982.2
UIX-Qwen2-7B75.982.978.8
Claude-3.5-Sonnet78.290.483.1
GPT-4o78.587.782.3
UI-TARS-2B72.989.286.4
UI-TARS-7B79.793.687.7
UI-TARS-72B82.889.388.6

Grounding Capability Evaluation

  • ScreenSpot Pro
Agent ModelDev-TextDev-IconDev-AvgCreative-TextCreative-IconCreative-AvgCAD-TextCAD-IconCAD-AvgScientific-TextScientific-IconScientific-AvgOffice-TextOffice-IconOffice-AvgOS-TextOS-IconOS-AvgAvg-TextAvg-IconAvg
QwenVL-7B0.00.00.00.00.00.00.00.00.00.70.00.40.00.00.00.00.00.00.10.00.1
GPT-4o1.30.00.71.00.00.62.00.01.52.10.01.21.10.00.90.00.00.01.30.00.8
SeeClick0.60.00.31.00.00.62.50.01.93.50.02.01.10.00.92.80.01.51.80.01.1
Qwen2-VL-7B2.60.01.31.50.00.90.50.00.46.30.03.53.41.93.00.90.00.52.50.21.6
OS-Atlas-4B7.10.03.73.01.42.32.00.01.59.05.57.55.13.84.85.60.03.15.01.73.7
ShowUI-2B16.91.49.49.10.05.32.50.01.913.27.310.615.37.513.510.32.26.610.82.67.7
CogAgent-18B14.90.78.09.60.05.67.13.16.122.21.813.413.00.010.05.60.03.112.00.87.7
Aria-UI16.20.08.423.72.114.77.61.66.127.16.418.120.31.916.14.70.02.617.12.011.3
UGround-7B26.62.114.727.32.817.014.21.611.131.92.719.331.611.327.017.80.09.725.02.816.5
Claude Computer Use22.03.912.625.93.416.814.53.711.933.915.825.830.116.326.911.04.58.123.47.117.1
OS-Atlas-7B33.11.417.728.82.817.912.24.710.337.57.324.433.95.727.427.14.516.828.14.018.9
UGround-V1-7B--35.5--27.8--13.5--38.8--48.8--26.1--31.1
UI-TARS-2B47.44.126.442.96.327.617.84.714.656.917.339.850.317.042.621.55.614.339.68.427.7
UI-TARS-7B58.412.436.150.09.132.820.89.418.063.931.850.063.320.853.530.816.924.547.816.235.7
UI-TARS-72B63.017.340.857.115.439.618.812.517.264.620.945.763.326.454.842.115.730.150.917.538.1

  • ScreenSpot
MethodMobile-TextMobile-Icon/WidgetDesktop-TextDesktop-Icon/WidgetWeb-TextWeb-Icon/WidgetAvg
Agent Framework
GPT-4 (SeeClick)76.655.568.028.640.923.348.8
GPT-4 (OmniParser)93.957.091.363.681.351.073.0
GPT-4 (UGround-7B)90.170.387.155.785.764.675.6
GPT-4o (SeeClick)81.059.869.633.643.926.252.3
GPT-4o (UGround-7B)93.476.992.867.988.768.981.4
Agent Model
GPT-422.624.520.211.89.28.816.2
GPT-4o20.224.921.123.612.27.818.3
CogAgent67.024.074.220.070.428.647.4
SeeClick78.052.072.230.055.732.553.4
Qwen2-VL75.560.776.354.335.225.755.3
UGround-7B82.860.382.563.680.470.473.3
Aguvis-G-7B88.378.288.170.785.774.881.8
OS-Atlas-7B93.072.991.862.990.974.382.5
Claude Computer Use------83.0
Gemini 2.0 (Project Mariner)------84.0
Aguvis-7B95.677.793.867.188.375.284.4
Aguvis-72B94.585.295.477.991.385.989.2
Our Model
UI-TARS-2B93.075.590.768.684.374.882.3
UI-TARS-7B94.585.295.985.790.083.589.5
UI-TARS-72B94.982.589.788.688.785.088.4

  • ScreenSpot v2
MethodMobile-TextMobile-Icon/WidgetDesktop-TextDesktop-Icon/WidgetWeb-TextWeb-Icon/WidgetAvg
Agent Framework
GPT-4o (SeeClick)85.258.879.937.172.730.163.6
GPT-4o (OS-Atlas-4B)95.575.879.449.390.266.579.1
GPT-4o (OS-Atlas-7B)96.283.489.769.394.079.887.1
Agent Model
SeeClick78.450.770.129.355.232.555.1
OS-Atlas-4B87.259.772.746.485.963.171.9
OS-Atlas-7B95.275.890.763.690.677.384.1
Our Model
UI-TARS-2B95.279.190.768.687.278.384.7
UI-TARS-7B96.989.195.485.093.685.291.6
UI-TARS-72B94.886.391.287.991.587.790.3

Offline Agent Capability Evaluation

  • Multimodal Mind2Web
MethodCross-Task Ele.AccCross-Task Op.F1Cross-Task Step SRCross-Website Ele.AccCross-Website Op.F1Cross-Website Step SRCross-Domain Ele.AccCross-Domain Op.F1Cross-Domain Step SR
Agent Framework
GPT-4o (SeeClick)32.1--33.1--33.5--
GPT-4o (UGround)47.7--46.0--46.6--
GPT-4o (Aria-UI)57.6--57.7--61.4--
GPT-4V (OmniParser)42.487.639.441.084.836.545.585.742.0
Agent Model
GPT-4o5.777.24.35.779.03.95.586.44.5
GPT-4 (SOM)29.6-20.320.1-13.927.0-23.7
GPT-3.5 (Text-only)19.459.216.814.956.514.125.257.924.1
GPT-4 (Text-only)40.863.132.330.261.027.035.461.929.7
Claude62.784.753.559.579.647.764.585.456.4
Aguvis-7B64.289.860.460.788.154.660.489.256.6
CogAgent--62.3--54.0--59.4
Aguvis-72B69.590.864.062.688.656.563.588.558.2
Our Model
UI-TARS-2B62.390.056.358.587.250.858.889.652.3
UI-TARS-7B73.192.267.168.290.961.766.690.960.5
UI-TARS-72B74.792.568.672.491.263.568.991.862.1

  • Android Control and GUI Odyssey
Agent ModelsAndroidControl-Low TypeAndroidControl-Low GroundingAndroidControl-Low SRAndroidControl-High TypeAndroidControl-High GroundingAndroidControl-High SRGUIOdyssey TypeGUIOdyssey GroundingGUIOdyssey SR
Claude74.30.019.463.70.012.560.90.03.1
GPT-4o74.30.019.466.30.020.834.30.03.3
SeeClick93.073.475.082.962.959.171.052.453.9
InternVL-2-4B90.984.180.184.172.766.782.155.551.5
Qwen2-VL-7B91.986.582.683.877.769.783.565.960.2
Aria-UI87.767.343.210.286.836.5
OS-Atlas-4B91.983.880.684.773.867.583.561.456.4
OS-Atlas-7B93.688.085.285.278.571.284.567.862.0
Aguvis-7B80.561.5
Aguvis-72B84.466.4
UI-TARS-2B98.187.389.381.278.468.993.986.883.4
UI-TARS-7B98.089.390.883.780.572.594.690.187.0
UI-TARS-72B98.189.991.385.281.574.795.491.488.6

Online Agent Capability Evaluation

MethodOSWorld (Online)AndroidWorld (Online)
Agent Framework
GPT-4o (UGround)-32.8
GPT-4o (Aria-UI)15.244.8
GPT-4o (Aguvis-7B)14.837.1
GPT-4o (Aguvis-72B)17.0-
GPT-4o (OS-Atlas-7B)14.6-
Agent Model
GPT-4o5.034.5 (SoM)
Gemini-Pro-1.55.422.8 (SoM)
Aguvis-72B10.326.1
Claude Computer-Use14.9 (15 steps)27.9
Claude Computer-Use22.0 (50 steps)-
Our Model
UI-TARS-7B-SFT17.7 (15 steps)33.0
UI-TARS-7B-DPO18.7 (15 steps)-
UI-TARS-72B-SFT18.8 (15 steps)46.6
UI-TARS-72B-DPO22.7 (15 steps)-
UI-TARS-72B-DPO24.6 (50 steps)-

2025-02-03(一)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值