数据标注 – 对猫的训练/

Captcha_Bypass

已于 2024-07-30 15:50:48 修改

阅读量954

点赞数 12

文章标签： python big data

于 2024-07-29 16:39:36 首次发布

本文链接：https://blog.csdn.net/Captcha_Bypass/article/details/140773299

版权

数据标注 – 对猫的训练

在深入研究自动化流程的某些时候，你会面临数据标注的需求，尽管就在几周前，你和短语数据标注就站在“互联网收入”这场派对的不同房间里。又或者，不如说你当时站在泳池边，而数据标注在三楼的阳台上和机器学习领域的专家抽烟。我们是怎么相遇的？大致就是有人把它推下阳台掉进了泳池里，我不顾浑身湿透帮它爬了出来。

因此你们坐在厨房里，两人同抽着一支烟，试着弄清楚各自是做什么的以及如何才能对彼此有用。

总的来说，我为什么需要它并不重要，这件事的成功要有趣得多。想必你已经听够了（或者没有），那就言归正传吧。

任务

标题中有这样一句话 – 对猫的训练 – 这并不是一个比喻。而是直接表明需要做什么。必需确定照片中描绘的是什么（哪种动物），客户将如何处理这些信息是第三件事。

从技术上讲，这项任务需尽可能明确。最重要的是，尽可能简单。现在我们只需要实施。我们将通过数据标注服务来实现这项任务。认真的吗，我连手动写各种标注程序都不会啦？甚至也不知道该怎么做。

总之，我们采用的是简化方法。

那么任务有了，方案有了，来谈谈细节吧：

我们在输入端提交一组描绘各种动物的照片和图片。任务为获得图像中描绘的动物的文本描述作为响应。数据量巨大，因此无法手动上传。我们将通过API发送所有这些内容，并为此编写一个简单的脚本。

脚本

首先，我们需要导入必要的库。在脚本中，我们将使用requests进行HTTP请求、base64编码图像、os处理文件系统以及json处理JSON数据。

import requests
import base64
import os
import json

创建任务的函数

现在我们编写一个函数，该函数会为数据标注服务的服务器创建任务。即为create_task函数。它会接受API URL、项目ID、图像路径和API密钥。

执行步骤：

打开图像并以base64对其进行编码。
构成包含编码图像的任务规格（task_spec）。
为请求创建有效负载。
设置请求标头，包括API密钥。
向2Captcha服务器发送POST请求。
处理服务器响应并返回结果。

def create_task(api_url, project_id, image_path, api_key):
    try:
        with open(image_path, "rb") as image_file:
            encoded_image = base64.b64encode(image_file.read()).decode('utf-8')
            image_data = f"data:image/jpeg;base64,{encoded_image}"
        
        task_spec = [
            {
                "image_with_animal": image_data
            }
        ]
        
        payload = {
            "project_id": project_id,
            "task_spec": task_spec  # Array with one object inside 
        } 

        headers = { 
            "Content-Type": "application/json",
            "Authorization": f"{api_key}" 
        } 

        print("Data to be sent:", json.dumps(payload, indent=4))  # Logging data before sending 

        response = requests.post(api_url + "/tasks", json=payload, headers=headers)
         
        if response.status_code == 201: 
            print(f"The task was successfully created for the file {image_path}")
            return True 
        else: 
            print(f"Error when creating a task for the file {image_path}: {response.status_code}, {response.text}")
            return False 
    except Exception as e: 
        print(f"An error occurred when creating a task for the file {image_path}: {str(e)}")
        return False

检查项目ID有效性的函数

在脚本测试期间，我必须创建一个万无一失的保护。validate_project_id 函数检查指定项目的ID是否正确。它向服务器发送GET请求，并返回验证结果。

def validate_project_id(api_url, project_id, api_key):
    headers = {
        "Authorization": f"{api_key}"
    }
    response = requests.get(f"{api_url}/projects/{project_id}", headers=headers)
    return response.status_code == 200

图像处理函数

由于图像并不会立即上传到项目中，而是分几部分，因此需要一个函数来处理和检查图像的重复性。process_images 函数处理指定目录中的所有图像。它会检查项目ID的有效性，读取图像检查它们是否已经被发送，并为新图像创建任务。

执行步骤：

检查项目ID的有效性。
加载已发送图像的历史记录。
迭代指定目录中的所有文件。
检查文件是否为图像，以及是否在之前发送过。
为每个新图像创建一个任务。
更新发送图像的历史记录。

def process_images(api_url, project_id, images_dir, api_key):
    try:
        if not validate_project_id(api_url, project_id, api_key):
            print(f"Incorrect `project_id`: {project_id}")
            return 

        sent_images = set() 
        # 用于存储已发送图像的文件 
        history_file = "sent_images.json" 

        # 如果文件存在，则加载已发送图像的历史记录 
        if os.path.exists(history_file):
            with open(history_file, "r") as file:
                sent_images = set(json.load(file))

        # 检查目录中的所有文件 
        for filename in os.listdir(images_dir):
            image_path = os.path.join(images_dir, filename) 
            # 检查该文件是否为图像，以及是否在早前发送过 
            if os.path.isfile(image_path) and filename.lower().endswith(('.png', '.jpg', '.jpeg')) and filename not in sent_images:
                print(f"File processing: {image_path}")
                if create_task(api_url, project_id, image_path, api_key):
                    sent_images.add(filename) 

        # 更新已发送图像的历史记录 
        with open(history_file, "w") as file: 
            json.dump(list(sent_images), file) 
            print("Process completed successfully.") 
except Exception as e: 
            print(f"An error occurred while processing images: {str(e)}")

程序主模块

我们在程序主模块中设置参数：API URL、项目ID、图像目录和API密钥。然后调用process_images 函数。

# 函数使用示例
if __name__ == "__main__":
    api_url = "http://dataapi.2captcha.com " # Updated API URL for creating tasks 
    project_id = 64 # Replace with your project ID
    images_dir = "C:/images " # Specify the directory with images 
    api_key = "Your API key" # Replace with your API key 

    # 检查目录中的图像 
    if not os.path.isdir(images_dir):
        print(f"The directory {images_dir} does not exist")
    else:
        image_files = [f for f in os.listdir(images_dir) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
        if not image_files:
            print(f"There are no images to process in the directory {images_dir}")
        else: 
            print(f"Found {len(image_files)} images for processing")
            process_images(api_url, project_id, images_dir, api_key)

通过这种方式，我得到一个可以在数据标注平台上自动创建任务Python脚本。该脚本从目录中读取图像，以base64编码，发送到服务器并保存发送图像的历史记录。在下方查看完整脚本。

import requests
import base64
import os
import json

def create_task(api_url, project_id, image_path, api_key):
    try:
        with open(image_path, "rb") as image_file:
            encoded_image = base64.b64encode(image_file.read()).decode('utf-8')
            image_data = f"data:image/jpeg;base64,{encoded_image}"
        
        task_spec = [
            {
                "image_with_animal": image_data
            }
        ]
        
        payload = {
            "project_id": project_id,
            "task_spec": task_spec  # Array with one object inside 
        } 

        headers = { 
            "Content-Type": "application/json",
            "Authorization": f"{api_key}" 
        } 

        print("Data to be sent:", json.dumps(payload, indent=4))  # Logging data before sending 

        response = requests.post(api_url + "/tasks", json=payload, headers=headers)
         
        if response.status_code == 201: 
            print(f"The task was successfully created for the file {image_path}")
            return True 
        else: 
            print(f"An error occurred when creating a task for the file {image_path}: {response.status_code}, {response.text}")
            return False 
    except Exception as e: 
        print(f"An error occurred when creating a task for the file {image_path}: {str(e)}")
        return False

def validate_project_id(api_url, project_id, api_key):
    headers = {
        "Authorization": f"{api_key}"
    }
    response = requests.get(f"{api_url}/projects/{project_id}", headers=headers)
    return response.status_code == 200

def process_images(api_url, project_id, images_dir, api_key):
    try:
        if not validate_project_id(api_url, project_id, api_key):
            print(f"Incorrect `project_id`: {project_id}")
            return 

    sent_images = set() 
    # 用于存储已发送图像的文件 
    history_file = "sent_images.json" 

    # 如果文件存在，则加载已发送图像的历史记录 
    if os.path.exists(history_file):
            with open(history_file, "r") as file:
                sent_images = set(json.load(file))

    # 检查目录中的所有文件 
    for filename in os.listdir(images_dir):
            image_path = os.path.join(images_dir, filename) 
            # 检查该文件是否为图像，以及是否在早前发送过 
            if os.path.isfile(image_path) and filename.lower().endswith(('.png', '.jpg', '.jpeg')) and filename not in sent_images:
                print(f"File processing: {image_path}")
                if create_task(api_url, project_id, image_path, api_key):
                    sent_images.add(filename) 

    # 更新已发送图像的历史记录 
    with open(history_file, "w") as file: 
        json.dump(list(sent_images), file) 
    print("Process completed successfully.") 
except Exception as e: 
    print(f"An error occurred while processing images: {str(e)}")

# 函数使用示例
if __name__ == "__main__":
    api_url = "http://dataapi.2captcha.com " # Updated API URL for creating tasks 
    project_id = 64 # Replace with your project ID
    images_dir = "C:/images " # Specify the directory with images 
    api_key = "Your API key" # Replace with your API key 

    # 检查目录中的图像 
    if not os.path.isdir(images_dir):
        print(f"The directory {images_dir} does not exist")
    else:
        image_files = [f for f in os.listdir(images_dir) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
        if not image_files:
            print(f"There are no images to process in the directory {images_dir}")
        else: 
            print(f"Found {len(image_files)} images for processing")
            process_images(api_url, project_id, images_dir, api_key)

配置脚本

现在，为了让脚本正常运行，您需要做好一切准备：

创建一个文件夹，在其中创建一个扩展名为.py的文件并随意命名。我没什么想象力，文件名就是script.py

此外，我们在存储图像的文件夹中创建一个子文件夹。该文件夹的路径写在脚本的第84行。

现在我们需要API密钥、API URL和项目编号 – 分别在脚本的第85、82和83行。

在数据标注服务中收集所有这些信息。

API密钥和项目编号在你的仪表盘中。

从数据标注服务的API文档中获取API URL，我已经在脚本中为你写好了。然而，如果你需要比标注动物更复杂的东西，可以根据自己的兴趣继续研究。

此外，你需要在数据标注服务中创建项目本身，以便有一个可以发送图像的地方。从理论上讲，可以通过API发送所有内容，但即使是我也被吓到了，都是手动完成的。如果你乐意，可以自己弄清楚如何通过API发送所有内容。

那么点击”新增项目<Add project>"按钮

填写“标题<Title>”、“描述<Description>”和“公共描述<Public description>”字段。描述和公开描述的区别在于：第一个为简短描述，公共的为任务描述。别问我为什么，这超出了我的能力范围。

选择语言（1）并创建两个规格（2和3）。（2）– 这些是用于发送图像的字段。只有两个选项 – 图像或文本，在示例中，我们需要发送图像，所以选择图像。

（3）- 这些是工作人员要用到的字段，实际上这些是为我们写下答案的字段（动物的标注）。因为我需要它来回答图片中描绘的是哪种动物，所以使用输入。除了输入，还有选择、单选和复选框。总的来说，有很多选择。

在下方截图中，你可以看到“必需项<Required>”复选框已选中 - 这是对你自己的一种保护（双重控制）- 避免发送空的任务。也就是说如果选中它，直至满足条件（在示例中为图像的存在）才能创建任务。

仍然有可能直接向服务器获取响应的结果，但我并不需要。可能很快就会有需要，但不是这次。

实际上就这些了，保存项目，复制它的编号粘贴到脚本中（第83行），就可以运行了！在开发人员控制台中使用命令python script.py运行。

然后任务会快速发送至工作人员，答案在解析完成后会以以下格式出现在个人账户中

就是这样，任务解决了。

Captcha_Bypass

关注

12
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
数据标注 – 对猫的训练/

在深入研究自动化流程的某些时候，你会面临数据标注的需求，尽管就在几周前，你和短语数据标注就站在“互联网收入”这场派对的不同房间里。又或者，不如说你当时站在泳池边，而数据标注在三楼的阳台上和机器学习领域的专家抽烟。我们是怎么相遇的？大致就是有人把它推下阳台掉进了泳池里，我不顾浑身湿透帮它爬了出来。
复制链接

扫一扫