【2024】Datawhale AI夏令营-从零上手Mobile Agent-Task1笔记

本文链接：https://blog.csdn.net/Mocode/article/details/141650999

Task1主要任务是跑通Mobile-Agent Demo。

一、主要步骤

1、申领大模型API

教程推荐使用阿里云百炼平台，申领个人的大模型API后，可通过API KEY调用平台上的视觉-语言大模型。后续使用的视觉-语言大模型为qwen-vl-plus。

2、下载Android Studio

3、在Android Studio中新建项目，创建并启动虚拟手机

虚拟手机启动后，为后续代码的运行做如下准备：将Calendar应用拖至虚拟手机的桌面。建议在运行代码文件前先打开一次Calendar，因为第一次打开应用往往会展示应用的欢迎页面而非直接进入功能页面，这些欢迎页面可能会影响大模型的视觉感知和后续动作。

4、安装Mobile-Agent框架

使用Anaconda创建虚拟环境以安装Mobile-Agent所需的库。

5、运行Mobile-Agent示例代码

使用Pycharm或VSCode打开Mobile-Agent项目文件夹，配置好虚拟环境
修改run.py中的adb_path和qwen_api变量。

adb_path是电脑中adb工具（adb.exe）的存储位置，一般为C:/Users/{your username}/AppData/Local/Android/Sdk/platform-tools/adb.exe；qwen_api则是在阿里云百炼平台申请的API KEY。
运行run.py

第一次运行时会下载较多工具（总共约4.3GB），下载时间较长。下载结束后，静待一段时间可以看到虚拟手机的界面发生变化，说明大模型正在控制手机以完成指令要求。

详细的步骤可参考Datawhale AI夏令营的Task1教程：Task1：跑通Mobile-Agent Demo。

二、Mobile-Agent框架

generate_local

generate_api

初始化操作历史记录，向其中添加一个描述AI助手角色的系统提示。

def init_action_chat():
    '''
    args:
    	None    
    returns:
   		operation_history: list，包含一个字典，表示系统提供的提示文本
    '''
    
    operation_history = []
    sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user\'s instruction."
    operation_history.append({'role': 'system','content': [{'text': sysetm_prompt}]})
    return operation_history

初始化操作历史记录，向其中添加一个描述AI助手角色的系统提示。（代码和init_memory_chat的一样）

def init_reflect_chat():
    '''
    args:
    	None    
    returns:
    	operation_history: list，包含一个字典，表示系统提供的提示文本
    '''
    
    operation_history = []
    sysetm_prompt = "You are a helpful AI mobile phone operating assistant."
    operation_history.append({'role': 'system','content': [{'text': sysetm_prompt}]})
    return operation_history

为一张图片（或单纯的prompt）添加操作记录

def add_response(role, prompt, chat_history, image=None):
    '''
    args:
        role: str，表示提供提示的角色，有system和user两种
        prompt:？
        chat_history: []，其中元素为形如{'role': str, 'content': [{'text': str}]}的字典，表示历史操作信息
        image: str（可选），图片的存储路径   
    returns:
    	new_chat_history: []，其中元素为形如{'role': str, 'content': [{'text': str}]}的字典，表示历史操作信息。与chat_history相比，以role角色添加了新的prompt记录
    '''
    
    new_chat_history = copy.deepcopy(chat_history)
    if image: ## 如果有图片，将prompt信息作为历史操作信息的文本，同时提供图片的路径
        content = [
            {
                'text': prompt
            },
            {
                'image': image
            },
        ]
    else: ## 如果没有图片，仅将prompt信息作为历史操作信息的文本
        content = [
            {
            "text": prompt
            },
        ]
    new_chat_history.append({'role': role, 'content': content}) ## 以role的角色提供新的操作信息
    return new_chat_history

为两张图片添加操作记录

def add_response_two_image(role, prompt, chat_history, image):
    '''
    args:
        role: str，表示提供提示的角色，有system和user两种
        prompt:？
        chat_history: []，其中元素为形如{'role': str, 'content': [{'text': str}]}的字典，表示历史操作信息
        image: [str, str]，对应上一张屏幕截图、初始屏幕截图两张图片的存储路径
    returns:
    	new_chat_history: []，其中元素为形如{'role': str, 'content': [{'text': str}]}的字典，表示历史操作信息。与chat_history相比，以role角色添加了新的prompt记录
    '''
        
    new_chat_history = copy.deepcopy(chat_history)
    content = [
        {
            "text": prompt
        },
        {
            'image': image[0]
        },
        {
            'image': image[1]
        },
    ]

    new_chat_history.append([role, content])
    return new_chat_history

获取文件夹中的所有文件（名称）：

def get_all_files_in_folder(folder_path):
    '''
    args:
    	folder_path: str，文件夹路径
    returns:
    	file_list: list[str]，文件夹中所有文件的名称
    '''
    
    file_list = []
    for file_name in os.listdir(folder_path):
        file_list.append(file_name)
    return file_list

在图片中根据坐标信息绘制椭圆

def draw_coordinates_on_image(image_path, coordinates):
    '''
    args:
    	image_path: str，图片的存储路径
    	coordinates: [[float, float]]，坐标信息，即(x, y)
    returns:
    	output_image_path: str，修改后（在坐标上添加椭圆）的图片的存储路径
    '''
    
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image) ## 创建ImageDraw.Draw对象，用于在图片上绘制图形。这个对象与image变量关联，绘制的图形将会出现在image变量代表的图片上
    ## 根据坐标信息，在图片上绘制椭圆，椭圆的中心点为输入的坐标信息
    point_size = 10
    for coord in coordinates:
        draw.ellipse((coord[0] - point_size, coord[1] - point_size, coord[0] + point_size, coord[1] + point_size), fill='red')
    ## 保存修改后的图片
    output_image_path = './screenshot/output_image.png'
    image.save(output_image_path)
    return output_image_path

裁剪图片，保存裁剪出的图片。

def crop(image, box, i):
    '''
    args:
    	image: str，图片的存储路径
    	box: [float, float, float, float]（大致），裁剪框的左下角和右上角的坐标（大致）
    	i: int，裁剪目标的序号
    returns:
    	None
    '''
    
    image = Image.open(image)
    x1, y1, x2, y2 = int(box[0]), int(box[1]), int(box[2]), int(box[3])
    if x1 >= x2-10 or y1 >= y2-10: ## 如果裁剪框的宽度或高度超过10，不进行剪裁，直接返回
        return
    cropped_image = image.crop((x1, y1, x2, y2))
    cropped_image.save(f"./temp/{i}.jpg") ## 根据图片的序号存储裁剪出的图片

def generate_local(tokenizer, model, image_file, query):
    '''
    args:
    	tokenizer: 分词器
    	model: 视觉-语言大模型
    	image_file: str，图片的存储路径
    	query: str，需要回答的请求（如：This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence.）
    returns:
    	response: 本地视觉-语言大模型对请求的回答内容
    '''
    
    query = tokenizer.from_list_format([
        {'image': image_file},
        {'text': query},
    ])
    response, _ = model.chat(tokenizer, query=query, history=None)
    return response

def merge_text_blocks(text_list, coordinates_list):
    '''
    args:
    	text_list: [str, str, ...]，每个str表示一个文本信息
    	coordinates_list: [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]，每个[x1, y1, x2, y2]表示一个坐标信息
    returns:
    	merged_text_blocks: [str, str, ...]，每个str表示合并若干个文本块后得到的文本信息
    	merged_coordinates: [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]，每个[x1, y1, x2, y2]表示合并若干个文本块后得到的坐标信息
    '''
    
    merged_text_blocks = []
    merged_coordinates = []

    sorted_indices = sorted(range(len(coordinates_list)), key=lambda k: (coordinates_list[k][1], coordinates_list[k][0])) ## 对坐标的索引进行排序，依据y坐标和x坐标进行排序
    sorted_text_list = [text_list[i] for i in sorted_indices] ## 根据坐标的排序索引对应排序列表中的文本内容
    sorted_coordinates_list = [coordinates_list[i] for i in sorted_indices] ## 根据坐标的排序索引对应排序列表中的坐标信息

    num_blocks = len(sorted_text_list)
    merge = [False] * num_blocks

    for i in range(num_blocks):
        if merge[i]: ## 若第i个坐标的文本与坐标信息已合并，跳过
            continue
        
        anchor = i ## 将当前的索引值设置为锚点
        
        group_text = [sorted_text_list[anchor]] ## 获取锚点的文本信息
        group_coordinates = [sorted_coordinates_list[anchor]] ## 获取锚点的坐标

        for j in range(i+1, num_blocks): ## 遍历锚点之后的其他点
            if merge[j]:
                continue

            if abs(sorted_coordinates_list[anchor][0] - sorted_coordinates_list[j][0]) < 10 and \ ## 当前点和锚点的x坐标之差＜10
            sorted_coordinates_list[j][1] - sorted_coordinates_list[anchor][3] >= -10 and sorted_coordinates_list[j][1] - sorted_coordinates_list[anchor][3] < 30 and \ ## 当前点和锚点的y坐标之差介于[-10, 30]
            abs(sorted_coordinates_list[anchor][3] - sorted_coordinates_list[anchor][1] - (sorted_coordinates_list[j][3] - sorted_coordinates_list[j][1])) < 10: ## 锚点的高度与当前点的高度之差＜10
                group_text.append(sorted_text_list[j]) ## 获取当前点的文本信息
                group_coordinates.append(sorted_coordinates_list[j]) ## 获取当前点的坐标信息
                merge[anchor] = True
                anchor = j
                merge[anchor] = True

        merged_text = "\n".join(group_text) ## 将当前组的所有文本块连接成一个字符串，使用换行符作为分隔符
        min_x1 = min(group_coordinates, key=lambda x: x[0])[0] ## 取当前组的所有文本块中的最小x值，下同
        min_y1 = min(group_coordinates, key=lambda x: x[1])[1]
        max_x2 = max(group_coordinates, key=lambda x: x[2])[2]
        max_y2 = max(group_coordinates, key=lambda x: x[3])[3]

        merged_text_blocks.append(merged_text) ## 若干个小文本块合并后的文本信息
        merged_coordinates.append([min_x1, min_y1, max_x2, max_y2]) ## 若干个小文本块合并后的坐标信息

    return merged_text_blocks, merged_coordinates

def process_image(image, query):
    '''
    args:
    	image: str，图片的存储路径
    	query: str，需要回答的请求（如：This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence.）
    returns:
    	response: str，（远程调用）视觉-大语言模型返回的回答
    '''
    
    dashscope.api_key = qwen_api ## qwen_api为自己提供的百炼云平台的API KEY
    image = "file://" + image
    ## 组装信息
    messages = [{
        'role': 'user',
        'content': [
            {
                'image': image
            },
            {
                'text': query
            },
        ]
    }]
    response = MultiModalConversation.call(model=caption_model, messages=messages) ## 调用视觉-语言大模型，针对信息生成回答。此处的caption_model='qwen-vl-plus'
    
    try:
        response = response['output']['choices'][0]['message']['content'][0]["text"]
    except:
        response = "This is an icon."
    
    return response

def generate_api(images, query):
    '''
    args:
    	images: [str, str, ...]，图片的存储路径的列表
    	query: str，需要回答的请求（如：This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence.）
    returns:
    	icon_map: {}，键是图片索引，值是process_image函数的处理结果
    '''
    
    icon_map = {}
    with concurrent.futures.ThreadPoolExecutor() as executor: ## 创建线程池，并发执行多个任务
        futures = {executor.submit(process_image, image, query): i for i, image in enumerate(images)} ## 并发执行process_image函数，将image和query作为该函数的参数。submit方法返回一个Future对象，代表操作的结果。字典推导式将这些Future对象映射到它们对应的图片索引i。
        
        for future in concurrent.futures.as_completed(futures):
            i = futures[future]
            response = future.result()
            icon_map[i + 1] = response ## 使用图片索引作为键，process_image函数的处理结果，即（远程调用）视觉-大语言模型针对query生成的回答为值
    
    return icon_map

使用多模态对话模型处理聊天信息，返回模型生成的响应。

def call_with_local_file(chat, api_key, model):
    '''
    args:
        chat: 
        model:
        api_key:
    returns:
    	多模态对话模型生成的响应的文本
    '''
    
    response = MultiModalConversation.call(model=model, messages=chat, api_key=api_key)
    return response.output.choices[0].message.content[0]["text"]

def get_perception_infos(adb_path, screenshot_file):
    '''
    args:
        adb_path: str，adb工具的路径
        screenshot_file: str，屏幕截图的存储路径 
    returns:
        perception_infos: list，感知信息列表，每个元素都是形如{"text": str, "coordinates": [int, int]}的字典，表示屏幕截图中文本块或图标的文本内容和中心点坐标
        width: float（不一定是float），屏幕截图的宽度
        height: float（不一定是float），屏幕截图的高度
    '''
    
    get_screenshot(adb_path) ## 调用MobileAgent.controller下的get_screenshot方法，获取屏幕截图
    
    width, height = Image.open(screenshot_file).size ## 使用PIL库打开传入的截图，获取其宽度和高度
    
    text, coordinates = ocr(screenshot_file, ocr_detection, ocr_recognition) ## 使用OCR模型检测、识别屏幕截图中的文字，返回识别到的文本和对应的坐标信息
    text, coordinates = merge_text_blocks(text, coordinates) ## 合并识别到的文本和对应的坐标信息
    
    center_list = [[(coordinate[0]+coordinate[2])/2, (coordinate[1]+coordinate[3])/2] for coordinate in coordinates] ## 计算每个文本的中心点坐标
    draw_coordinates_on_image(screenshot_file, center_list) ## 在屏幕截图上绘制文本的中心点
    
    ## 生成初步感知信息：遍历OCR模型检测、识别到的所有文本，生成感知信息（含文本内容和对应的坐标）
    perception_infos = []
    for i in range(len(coordinates)):
        perception_info = {"text": "text: " + text[i], "coordinates": coordinates[i]}
        perception_infos.append(perception_info)
    
    ## 使用图标检测模型检测屏幕截图中的图标，返回图标的坐标信息，并将图标的感知信息添加到感知信息列表中（含图标及对应的坐标）
    coordinates = det(screenshot_file, "icon", groundingdino_model)
    for i in range(len(coordinates)):
        perception_info = {"text": "icon", "coordinates": coordinates[i]}
        perception_infos.append(perception_info)
    
    image_box = [] ## 存储屏幕截图中图标的坐标信息（坐标信息可以确定图标的框）
    image_id = [] ## 存储屏幕截图中图标的id（将遍历顺序作为id）
    for i in range(len(perception_infos)):
        if perception_infos[i]['text'] == 'icon':
            image_box.append(perception_infos[i]['coordinates'])
            image_id.append(i)
    ## 根据检测到的图标的坐标信息，从屏幕截图中裁剪得到图标的图片
    for i in range(len(image_box)):
        crop(screenshot_file, image_box[i], image_id[i])
    images = get_all_files_in_folder(temp_file) ## 获取裁剪出的图标的图片文件。temp_file = 'temp'
    if len(images) > 0:
        images = sorted(images, key=lambda x: int(x.split('/')[-1].split('.')[0]))
        image_id = [int(image.split('/')[-1].split('.')[0]) for image in images]
        icon_map = {} ## 存储图标的描述信息，键是图标图片的遍历顺序id，值是图标图片的描述信息
        prompt = 'This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence.'
        ## 根据说明文字生成方法的不同（有local和api两种），调用generate_local或generate_api方法为图标图片生成描述信息
        if caption_call_method == "local":
            for i in range(len(images)):
                image_path = os.path.join(temp_file, images[i])
                icon_width, icon_height = Image.open(image_path).size ## 根据图标图片的路径打开图标图片，并获取图片的宽度和高度
                if icon_height > 0.8 * height or icon_width * icon_height > 0.2 * width * height:
                    des = "None" ## 当前图标的高度超过截图高度的80%，或图标的面积超过截图面积的20%，不对图标生成描述信息
                else:
                    des = generate_local(tokenizer, model, image_path, prompt) ## 根据图标图片生成描述信息
                icon_map[i+1] = des
        else:
            for i in range(len(images)):
                images[i] = os.path.join(temp_file, images[i])
            icon_map = generate_api(images, prompt) ## 根据图标图片生成描述信息
        ## 将感知信息列表中所有图标的文本内容都更新为生成的图标图片描述信息
        for i, j in zip(image_id, range(1, len(image_id)+1)):
            if icon_map.get(j):
                perception_infos[i]['text'] = "icon: " + icon_map[j]
    ## 将感知信息列表中每个元素（文本或者图标）的坐标信息都修改为对应元素的中心点坐标
    for i in range(len(perception_infos)):
        perception_infos[i]['coordinates'] = [int((perception_infos[i]['coordinates'][0]+perception_infos[i]['coordinates'][2])/2), int((perception_infos[i]['coordinates'][1]+perception_infos[i]['coordinates'][3])/2)]
        
    return perception_infos, width, height

iter = 0
while True:
    iter += 1
    if iter == 1: ## 执行截图操作，获取屏幕信息，并检查是否存在键盘
        ## 对屏幕进行截图，并获取关于屏幕截图的感知信息
        screenshot_file = "./screenshot/screenshot.jpg"
        perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
        shutil.rmtree(temp_file)
        os.mkdir(temp_file)
        ## 判断是否存在键盘（如果屏幕信息中包含特定文本（如'ADB Keyboard'），则存在键盘）
        keyboard = False
        keyboard_height_limit = 0.9 * height
        for perception_info in perception_infos:
            if perception_info['coordinates'][1] < keyboard_height_limit:
                continue
            if 'ADB Keyboard' in perception_info['text']:
                keyboard = True
                break
    ## 调用get_action_prompt函数获取操作提示
    prompt_action = get_action_prompt(instruction, perception_infos, width, height, keyboard, summary_history, action_history, summary, action, add_info, error_flag, completed_requirements, memory)
    ## 初始化聊天操作信息
    chat_action = init_action_chat()
    ## 在chat_action（已有聊天操作信息）的基础上，针对屏幕截图和操作提示，添加新的操作记录（含图片和操作提示）
    chat_action = add_response("user", prompt_action, chat_action, screenshot_file)
    ## 使用qwen-vl-plus这一视觉-语言大模型对聊天操作信息生成回答
    output_action = call_with_local_file(chat=chat_action, api_key=qwen_api, model='qwen-vl-plus')
    ## 从回答中提取思维（thought）、摘要（summary）和动作（action）
    thought = output_action.split("### Thought ###")[-1].split("### Action ###")[0].replace("\n", " ").replace(":", "").replace("  ", " ").strip()
    summary = output_action.split("### Operation ###")[-1].replace("\n", " ").replace("  ", " ").strip()
    action = output_action.split("### Action ###")[-1].split("### Operation ###")[0].replace("\n", " ").replace("  ", " ").strip()
    ## 在chat_action（已有聊天操作信息）的基础上，添加新的操作记录（将大模型的完整回答作为系统角色的回答）
    chat_action = add_response("system", output_action, chat_action)
    status = "#" * 50 + " Decision " + "#" * 50
    print(status)
    print(output_action)
    print('#' * len(status))
    
    if memory_switch: ## 如果启用了记忆功能（memory_switch），则还会获取记忆提示，并通过API获取记忆响应，将其添加到聊天中
        prompt_memory = get_memory_prompt(insight) ## 获取记忆提示模板，insight=""
        chat_action = add_response("user", prompt_memory, chat_action) ## 在chat_action（已有聊天操作信息）的基础上，添加新的操作记录（将记忆提示模板作为用户角色的追问）
        output_memory = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus') ## 使用qwen-vl-plus这一视觉-语言大模型对聊天操作信息生成回答（回答对记忆的追问）
        chat_action = add_response("system", output_memory, chat_action) ## 在chat_action（已有聊天操作信息）的基础上，添加新的操作记录（将大模型的完整回答作为系统角色的回答）
        status = "#" * 50 + " Memory " + "#" * 50
        print(status)
        print(output_memory)
        print('#' * len(status))
        output_memory = output_memory.split("### Important content ###")[-1].split("\n\n")[0].strip() + "\n" ## 从回答中提取重要内容（Important content）
        if "None" not in output_memory and output_memory not in memory:
            memory += output_memory ## 将此处的记忆内容添加到memory中（初始状态下，memory=""）
    
    ## 根据action变量的值，执行不同的动作，例如打开应用、点击、滑动、输入文本、返回、回到主屏幕等
    if "Open app" in action: ## 打开应用
        app_name = action.split("(")[-1].split(")")[0]
        text, coordinate = ocr(screenshot_file, ocr_detection, ocr_recognition)
        tap_coordinate = [0, 0]
        for ti in range(len(text)):
            if app_name == text[ti]:
                name_coordinate = [int((coordinate[ti][0] + coordinate[ti][2])/2), int((coordinate[ti][1] + coordinate[ti][3])/2)]
                tap(adb_path, name_coordinate[0], name_coordinate[1]- int(coordinate[ti][3] - coordinate[ti][1]))# 
    
    elif "Tap" in action: ## 点击
        coordinate = action.split("(")[-1].split(")")[0].split(", ")
        x, y = int(coordinate[0]), int(coordinate[1])
        tap(adb_path, x, y)
    
    elif "Swipe" in action: ## 滑动
        coordinate1 = action.split("Swipe (")[-1].split("), (")[0].split(", ")
        coordinate2 = action.split("), (")[-1].split(")")[0].split(", ")
        x1, y1 = int(coordinate1[0]), int(coordinate1[1])
        x2, y2 = int(coordinate2[0]), int(coordinate2[1])
        slide(adb_path, x1, y1, x2, y2)
        
    elif "Type" in action: ## 输入文本
        if "(text)" not in action:
            text = action.split("(")[-1].split(")")[0]
        else:
            text = action.split(" \"")[-1].split("\"")[0]
        type(adb_path, text)
    
    elif "Back" in action: ## 返回
        back(adb_path)
    
    elif "Home" in action: ## 回到主屏幕
        home(adb_path)
        
    elif "Stop" in action: ## 任务结束，程序终止
        break
    
    ## 本轮迭代即将结束，保存当前感知信息、屏幕截图和键盘信息，并获取新的感知信息和键盘信息
    time.sleep(5)
    
    last_perception_infos = copy.deepcopy(perception_infos)
    last_screenshot_file = "./screenshot/last_screenshot.jpg"
    last_keyboard = keyboard
    if os.path.exists(last_screenshot_file):
        os.remove(last_screenshot_file)
    os.rename(screenshot_file, last_screenshot_file) ## 将当前的截图文件重命名为上一次的截图文件
    
    perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
    shutil.rmtree(temp_file)
    os.mkdir(temp_file)
    keyboard = False
    for perception_info in perception_infos:
        if perception_info['coordinates'][1] < keyboard_height_limit:
            continue
        if 'ADB Keyboard' in perception_info['text']:
            keyboard = True
            break
    
    if reflection_switch: ## 如果启用了反思功能（reflection_switch），则会根据上一次和当前的感知信息获取反思提示，并通过API获取反思响应。根据反思响应的内容，可能会更新错误标志（error_flag），执行返回操作或更新已完成的要求列表（completed_requirements）
        ## 根据上一次和当前的感知信息获取反思提示
        prompt_reflect = get_reflect_prompt(instruction, last_perception_infos, perception_infos, width, height, last_keyboard, keyboard, summary, action, add_info) 
        ## 初始化聊天操作（反思）信息
        chat_reflect = init_reflect_chat()
        ## 在chat_reflect（已有聊天操作信息）的基础上，针对上一次、当前屏幕截图和反思操作提示，添加新的操作记录（含两张图片和操作提示）
        chat_reflect = add_response_two_image("user", prompt_reflect, chat_reflect, [last_screenshot_file, screenshot_file])
		## 使用qwen-vl-plus这一视觉-语言大模型对聊天操作信息生成回答
        output_reflect = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
        ## 从回答中提取反思内容（Answer）
        reflect = output_reflect.split("### Answer ###")[-1].replace("\n", " ").strip()
        ## 在chat_reflect（已有聊天操作信息）的基础上，添加新的操作记录（将大模型的完整回答作为系统角色的回答）
        chat_reflect = add_response("system", output_reflect, chat_reflect)
        status = "#" * 50 + " Reflcetion " + "#" * 50
        print(status)
        print(output_reflect)
        print('#' * len(status))
    
        if 'A' in reflect: ## 若反思内容中包含A（这可能表示模型对自己回答的评价，A表示认为自己回答较好）
            thought_history.append(thought)
            summary_history.append(summary)
            action_history.append(action)
            ## 根据已有的思维、摘要和动作，新建记忆内容并使用模型进行回答
            prompt_planning = get_process_prompt(instruction, thought_history, summary_history, action_history, completed_requirements, add_info)
            chat_planning = init_memory_chat()
            chat_planning = add_response("user", prompt_planning, chat_planning)
            output_planning = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
            chat_planning = add_response("system", output_planning, chat_planning)
            status = "#" * 50 + " Planning " + "#" * 50
            print(status)
            print(output_planning)
            print('#' * len(status))
            completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()
            
            error_flag = False
        ## 若反思内容中包含B或C，不会记录本轮迭代产生的思维、摘要和动作
        elif 'B' in reflect:
            error_flag = True
            back(adb_path)            
        elif 'C' in reflect:
            error_flag = True
    
    else: ## 若没有开启反思功能，每次迭代产生的思维、摘要和动作都会直接进行记录，然后根据上一次和当前的感知信息获取反思提示，并通过API获取反思响应。根据反思响应的内容，可能会更新错误标志（error_flag），执行返回操作或更新已完成的要求列表（completed_requirements）
        thought_history.append(thought)
        summary_history.append(summary)
        action_history.append(action)
        
        prompt_planning = get_process_prompt(instruction, thought_history, summary_history, action_history, completed_requirements, add_info)
        chat_planning = init_memory_chat()
        chat_planning = add_response("user", prompt_planning, chat_planning)
        output_planning = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')
        chat_planning = add_response("system", output_planning, chat_planning)
        status = "#" * 50 + " Planning " + "#" * 50
        print(status)
        print(output_planning)
        print('#' * len(status))
        completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()
         
    os.remove(last_screenshot_file)

三、曲折

GPU的必要性

使用GPU

安装Android Studio和启动虚拟手机的过程都比较顺利，在后面创建虚拟环境和下载的时候遇到一些问题。

首先，在配置python环境前观察了run.py和win_requirements.txt，发现需要安装torch和tensorflow。这两个库一般需要N卡GPU才能使用，但是我的本地电脑（安装了Android Studio和启动虚拟手机的设备）没有N卡，只有远程服务器上有N卡，但是run.py中需要填写adb_path，于是思考：当程序运行在远程服务器、虚拟设备和adb在本地的时候，还可以运行吗？

查阅GPT，GPT给出了一个方案：

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

我觉得使用SSH隧道转发是比较适合我的，于是追问如何使用SSH隧道转发：

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

这里的方法没有进行验证。

不使用GPU

后来想到pytorch、tensorflow都有支持CPU的版本，因此觉得应该不影响项目的运行，于是安装配置环境。或者远程服务器和本地同步安装配置环境。

远程服务器上安装遇到的问题

在远程服务器中，首先遇到问题：

执行git lfs install失败，报错：

git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
        log

尝试安装git-lfs：

sudo apt-get install git-lfs

报错：

eading package lists... Done
Building dependency tree       
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 update-manager : Depends: update-manager-core (= 1:20.04.10.22) but 1:20.04.10.21 is to be installed
 update-manager-core : Depends: python3-update-manager (= 1:20.04.10.21) but 1:20.04.10.22 is to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

没有继续排错，后面因为git-lfs没安装好，在执行pip install -r win_requirements.txt时也失败了。后面没有再纠结在远程服务器上安装。

本地安装遇到的问题

使用vscode

可以创建环境，但不能激活环境，即可以执行conda create -n moblieagent python=3.9.19，但不能执行conda activate moblieagent，执行的报错信息如下：

usage: conda-script.py [-h] [--no-plugins] [-V] COMMAND ...
notices', 'package', 'remove', 'uninstall', 'rename', 'run', 'search', 'update', 'upgrade', 'build', 'convert', 'debug', 'develop', 'doctor', 'index', 'inspect', 'metapackage', 'render', 'skeleton', 'repo', 'verify', 'content-trust', 'token', 'env', 'pack', 'server')

这可能与未在vscode中配置anaconda虚拟环境相关的设置有关。暂未解决。

使用Pycharm

因为以前使用Pycharm可以给项目配置anaconda环境，因此改用Pycharm。使用anaconda的prompt窗口，在命令行中激活虚拟环境，进入到代码文件夹中，然后执行pip install -r win_requirements.txt。等待一段时间就安装好了。

遇到的问题

使用vscode

可以创建环境，但不能激活环境，即可以执行conda create -n moblieagent python=3.9.19，但不能执行conda activate moblieagent，执行的报错信息如下：

usage: conda-script.py [-h] [--no-plugins] [-V] COMMAND ...
notices', 'package', 'remove', 'uninstall', 'rename', 'run', 'search', 'update', 'upgrade', 'build', 'convert', 'debug', 'develop', 'doctor', 'index', 'inspect', 'metapackage', 'render', 'skeleton', 'repo', 'verify', 'content-trust', 'token', 'env', 'pack', 'server')

这可能与未在vscode中配置anaconda虚拟环境相关的设置有关。暂未解决。