系列文章目录
Task1
Task2 (loading…)
Task3 (loading…)
Task 1
文章目录
- 前言
- 一、跑通Demo
- step 1: 环境搭建
- step 2: 导入相关库和模块
- step 3: 设置相关参数和配置
- step 4: 定义函数
- step 5: 加载标题模型、OCR和图标检测模型,并初始化变量和文件夹
- step 6: 进入主循环,根据屏幕信息和用户指令执行相应操作
- 二、学习与相关问题
- 1. 重难点学习
- 2. 问题与初步解答
- 总结
前言
本次夏令营聚焦于Mobile-Agent(移动智能体),它是一个可以支持移动设备及其他终端设备与人工智能深度融合的多Agent框架。本文主要是记录Task1——跑通第一个Mobile Agent Demo的过程,之后的文章会继续深入理解其框架原理,做出更多大模型驱动Agent操控手机的相关应用。
本篇文章主要是记录和分享自己在夏令营中的学习过程和遇到的困难
一、跑通Demo
step 1: 环境搭建
这里可以参考Datawhale官方文件:https://datawhaler.feishu.cn/wiki/BbEuwzZMXiWwxbkfFuHcflwrneg?from=from_copylink
文件可能需要登陆后学习,里面有详细的使用指导,关于如何下载和安装Android Studio,如何创建项目和Android模拟器,如何使用VScode搭建Mobile Agent 框架等等。
step 2: 导入相关库和模块
导入所需的库和模块,包括操作系统、时间、深拷贝、PyTorch、图像处理等
import os
import time
import copy
import torch
import shutil
from PIL import Image, ImageDraw
导入手机操作相关的模块,如文本定位、图标定位、控制器操作等
from MobileAgent.text_localization import ocr
from MobileAgent.icon_localization import det
from MobileAgent.controller import get_screenshot, tap, slide, type, back, home
from MobileAgent.prompt import get_action_prompt, get_reflect_prompt, get_memory_prompt
导入多模态对话模型相关的模块
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope import snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from dashscope import MultiModalConversation
import dashscope
import concurrent
step 3: 设置相关参数和配置
这里需要根据实际情况手动编辑几个设置来配置助手的行为,比如指定ADB路径、使用的模型等等
# Your ADB path
# 安装在C盘的路径应为"/Users/用户名/Library/Android/sdk/platform-tools/adb",如果是自定义路径,则应为"adb.exe"所在的地址
adb_path = ""
# Your instruction
instruction = "Read the Screen, tell me what day it is today. Then open Play Store."
# Choose between "api" and "local". api: use the qwen api. local: use the local qwen checkpoint
caption_call_method = "api"
# Choose between "qwen-vl-plus" and "qwen-vl-max" if use api method. Choose between "qwen-vl-chat" and "qwen-vl-chat-int4" if use local method.
caption_model = ""
# If you choose the api caption call method, input your Qwen api here
qwen_api = ""
# You can add operational knowledge to help Agent operate more accurately.
add_info = "If you want to tap an icon of an app, use the action \"Open app\". If you want to exit an app, use the action \"Home\""
# Reflection Setting: If you want to improve the operating speed, you can disable the reflection agent. This may reduce the success rate.
reflection_switch = True
# Memory Setting: If you want to improve the operating speed, you can disable the memory unit. This may reduce the success rate.
memory_switch = True
step 4: 定义函数
init_action_chat()、init_reflect_chat()、init_memory_chat()函数分别用于初始化操作、反射和记忆的对话历史,返回一个包含系统提示信息的对话历史列表
def init_action_chat():
operation_history = []
sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user\'s instruction."
operation_history.append({'role': 'system','content': [{'text': sysetm_prompt}]})
return operation_history
def init_reflect_chat():
operation_history = []
sysetm_prompt = "You are a helpful AI mobile phone operating assistant."
operation_history.append({'role': 'system','content': [{'text': sysetm_prompt}]})
return operation_history
def init_memory_chat():
operation_history = []
sysetm_prompt = "You are a helpful AI mobile phone operating assistant."
operation_history.append({'role': 'system','content': [{'text': sysetm_prompt}]})
return operation_history
call_with_local_file(chat, api_key, model) 函数用于调用多模态对话,传入对话内容、API密钥和模型,返回对话结果中的文本内容
def call_with_local_file(chat, api_key, model):
response = MultiModalConversation.call(model=model, messages=chat, api_key=api_key)
return response.output.choices[0].message.content[0]["text"]
add_response(role, prompt, chat_history, image=None) 函数用于向对话历史中添加用户或系统的回复,可以包含文本和图片信息
def add_response(role, prompt, chat_history, image=None):
new_chat_history = copy.deepcopy(chat_history)
if image:
content = [
{
'text': prompt
},
{
'image': image
},
]
else:
content = [
{
"text": prompt
},
]
new_chat_history.append({'role': role, 'content': content})
return new_chat_history
add_response_two_image(role, prompt, chat_history, image) 函数类似于上一个函数,但是可以添加两张图片
def add_response_two_image(role, prompt, chat_history, image):
new_chat_history = copy.deepcopy(chat_history)
content = [
{
"text": prompt
},
{
'image': image[0]
},
{
'image': image[1]
},
]
new_chat_history.append([role, content])
return new_chat_history
get_all_files_in_folder(folder_path) 函数用于获取指定文件夹中的所有文件名列表
def get_all_files_in_folder(folder_path):
file_list = []
for file_name in os.listdir(folder_path):
file_list.append(file_name)
return file_list
draw_coordinates_on_image(image_path, coordinates) 、crop(image, box, i) 函数用于在图片上绘制坐标点并据此裁剪图片后保存,根据给定的坐标框裁剪图片并保存
def draw_coordinates_on_image(image_path, coordinates):
image = Image.open(image_path)
draw = ImageDraw.Draw(image)
point_size = 10
for coord in coordinates:
draw.ellipse((coord[0] - point_size, coord[1] - point_size, coord[0] + point_size, coord[1] + point_size), fill='red')
output_image_path = './screenshot/output_image.png'
image.save(output_image_path)
return output_image_path
def crop(image, box, i):
image = Image.open(image)
x1, y1, x2, y2 = int(box[0]), int(box[1]), int(box[2]), int(box[3])
if x1 >= x2-10 or y1 >= y2-10:
return
cropped_image = image.crop((x1, y1, x2, y2))
cropped_image.save(f"./temp/{i}.jpg")
generate_local(tokenizer, model, image_file, query) 函数用于生成本地对话,传入tokenizer、模型、图片文件和查询内容,返回对话结果
def generate_local(tokenizer, model, image_file, query):
query = tokenizer.from_list_format([
{'image': image_file},
{'text': query},
])
response, _ = model.chat(tokenizer, query=query, history=None)
return response
process_image(image, query) 函数用于处理图片信息,调用多模态对话API,返回处理后的文本结果
def process_image(image, query):
dashscope.api_key = qwen_api
image = "file://" + image
messages = [{
'role': 'user',
'content': [
{
'image': image
},
{
'text': query
},
]
}]
response = MultiModalConversation.call(model=caption_model, messages=messages)
try:
response = response['output']['choices'][0]['message']['content'][0]["text"]
except:
response = "This is an icon."
return response
generate_api(images, query) 函数用于生成API对话,传入图片列表和查询内容,返回处理后的文本结果
def generate_api(images, query):
icon_map = {}
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(process_image, image, query): i for i, image in enumerate(images)}
for future in concurrent.futures.as_completed(futures):
i = futures[future]
response = future.result()
icon_map[i + 1] = response
return icon_map
merge_text_blocks(text_list, coordinates_list) 函数用于合并文本块,根据坐标信息将相邻的文本块合并成一个块
def merge_text_blocks(text_list, coordinates_list):
merged_text_blocks = []
merged_coordinates = []
sorted_indices = sorted(range(len(coordinates_list)), key=lambda k: (coordinates_list[k][1], coordinates_list[k][0]))
sorted_text_list = [text_list[i] for i in sorted_indices]
sorted_coordinates_list = [coordinates_list[i] for i in sorted_indices]
num_blocks = len(sorted_text_list)
merge = [False] * num_blocks
for i in range(num_blocks):
if merge[i]:
continue
anchor = i
group_text = [sorted_text_list[anchor]]
group_coordinates = [sorted_coordinates_list[anchor]]
for j in range(i+1, num_blocks):
if merge[j]:
continue
if abs(sorted_coordinates_list[anchor][0] - sorted_coordinates_list[j][0]) < 10 and \
sorted_coordinates_list[j][1] - sorted_coordinates_list[anchor][3] >= -10 and sorted_coordinates_list[j][1] - sorted_coordinates_list[anchor][3] < 30 and \
abs(sorted_coordinates_list[anchor][3] - sorted_coordinates_list[anchor][1] - (sorted_coordinates_list[j][3] - sorted_coordinates_list[j][1])) < 10:
group_text.append(sorted_text_list[j])
group_coordinates.append(sorted_coordinates_list[j])
merge[anchor] = True
anchor = j
merge[anchor] = True
merged_text = "\n".join(group_text)
min_x1 = min(group_coordinates, key=lambda x: x[0])[0]
min_y1 = min(group_coordinates, key=lambda x: x[1])[1]
max_x2 = max(group_coordinates, key=lambda x: x[2])[2]
max_y2 = max(group_coordinates, key=lambda x: x[3])[3]
merged_text_blocks.append(merged_text)
merged_coordinates.append([min_x1, min_y1, max_x2, max_y2])
return merged_text_blocks, merged_coordinates
get_perception_infos(adb_path, screenshot_file) 函数用于获取感知信息,包括屏幕截图的文本识别、图标定位、图像处理等操作,最终返回感知信息列表、屏幕宽高等信息
def get_perception_infos(adb_path, screenshot_file):
get_screenshot(adb_path)
width, height = Image.open(screenshot_file).size
text, coordinates = ocr(screenshot_file, ocr_detection, ocr_recognition)
text, coordinates = merge_text_blocks(text, coordinates)
center_list = [[(coordinate[0]+coordinate[2])/2, (coordinate[1]+coordinate[3])/2] for coordinate in coordinates]
draw_coordinates_on_image(screenshot_file, center_list)
perception_infos = []
for i in range(len(coordinates)):
perception_info = {"text": "text: " + text[i], "coordinates": coordinates[i]}
perception_infos.append(perception_info)
coordinates = det(screenshot_file, "icon", groundingdino_model)
for i in range(len(coordinates)):
perception_info = {"text": "icon", "coordinates": coordinates[i]}
perception_infos.append(perception_info)
image_box = []
image_id = []
for i in range(len(perception_infos)):
if perception_infos[i]['text'] == 'icon':
image_box.append(perception_infos[i]['coordinates'])
image_id.append(i)
for i in range(len(image_box)):
crop(screenshot_file, image_box[i], image_id[i])
images = get_all_files_in_folder(temp_file)
if len(images) > 0:
images = sorted(images, key=lambda x: int(x.split('/')[-1].split('.')[0]))
image_id = [int(image.split('/')[-1].split('.')[0]) for image in images]
icon_map = {}
prompt = 'This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence.'
if caption_call_method == "local":
for i in range(len(images)):
image_path = os.path.join(temp_file, images[i])
icon_width, icon_height = Image.open(image_path).size
if icon_height > 0.8 * height or icon_width * icon_height > 0.2 * width * height:
des = "None"
else:
des = generate_local(tokenizer, model, image_path, prompt)
icon_map[i+1] = des
else:
for i in range(len(images)):
images[i] = os.path.join(temp_file, images[i])
icon_map = generate_api(images, prompt)
for i, j in zip(image_id, range(1, len(image_id)+1)):
if icon_map.get(j):
perception_infos[i]['text'] = "icon: " + icon_map[j]
for i in range(len(perception_infos)):
perception_infos[i]['coordinates'] = [int((perception_infos[i]['coordinates'][0]+perception_infos[i]['coordinates'][2])/2), int((perception_infos[i]['coordinates'][1]+perception_infos[i]['coordinates'][3])/2)]
return perception_infos, width, height
step 5: 加载标题模型、OCR和图标检测模型,并初始化变量和文件夹
加载标题模型:根据选择的本地或API调用方式以及标题模型类型,加载相应的标题生成模型和分词器
### Load caption model ###
device = "cpu"
torch.manual_seed(1234)
if caption_call_method == "local":
if caption_model == "qwen-vl-chat":
model_dir = snapshot_download('qwen/Qwen-VL-Chat', revision='v1.1.0')
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)
elif caption_model == "qwen-vl-chat-int4":
qwen_dir = snapshot_download("qwen/Qwen-VL-Chat-Int4", revision='v1.0.0')
model = AutoModelForCausalLM.from_pretrained(qwen_dir, device_map=device, trust_remote_code=True,use_safetensors=True).eval()
model.generation_config = GenerationConfig.from_pretrained(qwen_dir, trust_remote_code=True, do_sample=False)
else:
print("If you choose local caption method, you must choose the caption model from \"Qwen-vl-chat\" and \"Qwen-vl-chat-int4\"")
exit(0)
tokenizer = AutoTokenizer.from_pretrained(qwen_dir, trust_remote_code=True)
elif caption_call_method == "api":
pass
else:
print("You must choose the caption model call function from \"local\" and \"api\"")
exit(0)
加载OCR和图标检测模型:下载并加载OCR(光学字符识别)和图标检测模型,用于处理屏幕截图中的文本和图标信息
### Load ocr and icon detection model ###
groundingdino_dir = snapshot_download('AI-ModelScope/GroundingDINO', revision='v1.0.0')
groundingdino_model = pipeline('grounding-dino-task', model=groundingdino_dir, device="cpu")
ocr_detection = pipeline(Tasks.ocr_detection, model='damo/cv_resnet18_ocr-detection-line-level_damo', device="cpu")
ocr_recognition = pipeline(Tasks.ocr_recognition, model='damo/cv_convnextTiny_ocr-recognition-document_damo', device="cpu")
初始化变量和文件夹:初始化一系列变量,包括对话历史、总结、操作、已完成要求、记忆、洞察等,并创建临时文件夹和屏幕截图文件夹
thought_history = []
summary_history = []
action_history = []
summary = ""
action = ""
completed_requirements = ""
memory = ""
insight = ""
temp_file = "temp"
screenshot = "screenshot"
if not os.path.exists(temp_file):
os.mkdir(temp_file)
else:
shutil.rmtree(temp_file)
os.mkdir(temp_file)
if not os.path.exists(screenshot):
os.mkdir(screenshot)
error_flag = False
step 6: 进入主循环,根据屏幕信息和用户指令执行相应操作
- 获取屏幕截图信息和感知信息
- 根据感知信息生成操作提示
- 进行对话交互,获取对话结果中的思考、总结和操作信息。
- 根据操作信息执行相应的操作,如打开应用、点击、滑动、输入文本等
- 如果需要记忆功能,获取记忆提示并更新记忆内容
- 如果需要反思功能,获取反思提示并进行反思,根据反思结果执行相应操作或标记错误
- 根据情况更新对话历史和已完成要求
- 循环进行以上步骤,直到出现停止操作指令或达到设定条件。
iter = 0
while True:
iter += 1
if iter == 1:
screenshot_file = "./screenshot/screenshot.jpg"
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
shutil.rmtree(temp_file)
os.mkdir(temp_file)
keyboard = False
keyboard_height_limit = 0.9 * height
for perception_info in perception_infos:
if perception_info['coordinates'][1] < keyboard_height_limit:
continue
if 'ADB Keyboard' in perception_info['text']:
keyboard = True
break
prompt_action = get_action_prompt(instruction, perception_infos, width, height, keyboard, summary_history, action_history, summary, action, add_info, error_flag, completed_requirements, memory)
chat_action = init_action_chat()
chat_action = add_response("user", prompt_action, chat_action, screenshot_file)
output_action = call_with_local_file(chat=chat_action, api_key=qwen_api, model='qwen-vl-plus')
thought = output_action.split("### Thought ###")[-1].split("### Action ###")[0].replace("\n", " ").replace(":", "").replace(" ", " ").strip()
summary = output_action.split("### Operation ###")[-1].replace("\n", " ").replace(" ", " ").strip()
action = output_action.split("### Action ###")[-1].split("### Operation ###")[0].replace("\n", " ").replace(" ", " ").strip()
chat_action = add_response("system", output_action, chat_action)
status = "#" * 50 + " Decision " + "#" * 50
print(status)
print(output_action)
print('#' * len(status))
if memory_switch:
prompt_memory = get_memory_prompt(insight)
chat_action = add_response("user", prompt_memory, chat_action)
output_memory = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
chat_action = add_response("system", output_memory, chat_action)
status = "#" * 50 + " Memory " + "#" * 50
print(status)
print(output_memory)
print('#' * len(status))
output_memory = output_memory.split("### Important content ###")[-1].split("\n\n")[0].strip() + "\n"
if "None" not in output_memory and output_memory not in memory:
memory += output_memory
if "Open app" in action:
app_name = action.split("(")[-1].split(")")[0]
text, coordinate = ocr(screenshot_file, ocr_detection, ocr_recognition)
tap_coordinate = [0, 0]
for ti in range(len(text)):
if app_name == text[ti]:
name_coordinate = [int((coordinate[ti][0] + coordinate[ti][2])/2), int((coordinate[ti][1] + coordinate[ti][3])/2)]
tap(adb_path, name_coordinate[0], name_coordinate[1]- int(coordinate[ti][3] - coordinate[ti][1]))#
elif "Tap" in action:
coordinate = action.split("(")[-1].split(")")[0].split(", ")
x, y = int(coordinate[0]), int(coordinate[1])
tap(adb_path, x, y)
elif "Swipe" in action:
coordinate1 = action.split("Swipe (")[-1].split("), (")[0].split(", ")
coordinate2 = action.split("), (")[-1].split(")")[0].split(", ")
x1, y1 = int(coordinate1[0]), int(coordinate1[1])
x2, y2 = int(coordinate2[0]), int(coordinate2[1])
slide(adb_path, x1, y1, x2, y2)
elif "Type" in action:
if "(text)" not in action:
text = action.split("(")[-1].split(")")[0]
else:
text = action.split(" \"")[-1].split("\"")[0]
type(adb_path, text)
elif "Back" in action:
back(adb_path)
elif "Home" in action:
home(adb_path)
elif "Stop" in action:
break
time.sleep(5)
last_perception_infos = copy.deepcopy(perception_infos)
last_screenshot_file = "./screenshot/last_screenshot.jpg"
last_keyboard = keyboard
if os.path.exists(last_screenshot_file):
os.remove(last_screenshot_file)
os.rename(screenshot_file, last_screenshot_file)
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
shutil.rmtree(temp_file)
os.mkdir(temp_file)
keyboard = False
for perception_info in perception_infos:
if perception_info['coordinates'][1] < keyboard_height_limit:
continue
if 'ADB Keyboard' in perception_info['text']:
keyboard = True
break
if reflection_switch:
prompt_reflect = get_reflect_prompt(instruction, last_perception_infos, perception_infos, width, height, last_keyboard, keyboard, summary, action, add_info)
chat_reflect = init_reflect_chat()
chat_reflect = add_response_two_image("user", prompt_reflect, chat_reflect, [last_screenshot_file, screenshot_file])
output_reflect = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
reflect = output_reflect.split("### Answer ###")[-1].replace("\n", " ").strip()
chat_reflect = add_response("system", output_reflect, chat_reflect)
status = "#" * 50 + " Reflcetion " + "#" * 50
print(status)
print(output_reflect)
print('#' * len(status))
if 'A' in reflect:
thought_history.append(thought)
summary_history.append(summary)
action_history.append(action)
prompt_planning = get_process_prompt(instruction, thought_history, summary_history, action_history, completed_requirements, add_info)
chat_planning = init_memory_chat()
chat_planning = add_response("user", prompt_planning, chat_planning)
output_planning = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
chat_planning = add_response("system", output_planning, chat_planning)
status = "#" * 50 + " Planning " + "#" * 50
print(status)
print(output_planning)
print('#' * len(status))
completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()
error_flag = False
elif 'B' in reflect:
error_flag = True
back(adb_path)
elif 'C' in reflect:
error_flag = True
else:
thought_history.append(thought)
summary_history.append(summary)
action_history.append(action)
prompt_planning = get_process_prompt(instruction, thought_history, summary_history, action_history, completed_requirements, add_info)
chat_planning = init_memory_chat()
chat_planning = add_response("user", prompt_planning, chat_planning)
output_planning = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
chat_planning = add_response("system", output_planning, chat_planning)
status = "#" * 50 + " Planning " + "#" * 50
print(status)
print(output_planning)
print('#' * len(status))
completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()
os.remove(last_screenshot_file)
以上六个步骤就实现了一个基于屏幕截图和对话交互的智能辅助系统,可以根据屏幕信息和用户指令执行相应操作,并具有记忆和反思功能,Demo跑通!
二、学习与相关问题
1. 重难点学习
来源于 https://datawhaler.feishu.cn/wiki/BbEuwzZMXiWwxbkfFuHcflwrneg?from=from_copylink
2. 问题与初步解答
问题1:为什么启动模拟器时出现如下报错:
回答:这个warning是因为安装Android Studio时不在C盘的默认路径,只需要修改一下文件位置,再设置一下环境变量就行。具体操作请参考:https://blog.csdn.net/qq_43224762/article/details/121255382
问题2:为什么Android Studio启动时出现如下报错:
回答:这个warning需要从链接里下载这个.zip文件,然后转移到项目文件夹的wrapper文件里(这里需要先找到它,我自己的是E:\AndroidStudioProjects\MobileAgent\gradle\wrapper,其中MobileAgent就是项目文件夹,找到后在wrapper文件夹中找到dists文件,没有就创建一个,然后把这个.zip文件复制进去,再将wrapper文件夹中的.properties文件中的distributionURL改成自己本地的.zip文件的路径)
问题3:为什么安装框架时出现如下报错:
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If using 'conda activate' from a batch script, change your
invocation to 'CALL conda.bat activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- cmd.exe
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
回答:这个warning很好解决,只需要换成cmd逐行进行以下操作(本来是powershell)再重启终端,重新激活环境即可:
CALL conda.bat activate
conda init cmd.exe
总结
只是本篇针对Task1的总结,系列总结会在第三篇
以上就是对Mobile Agent Demo的详细讲解,也进行了一定的学习与问题分享,后续会继续进阶学习,深入理解框架原理,设计更多应用,持续分享。