【YOLOv3/YOLOv5/YOLOv6/YOLOv7/YOLOv8】使用shuffle帮助一键自动划分数据集【训练集、验证集（、测试集）】可指定比例、数量

zyt_820

已于 2024-01-17 22:00:15 修改

阅读量720

点赞数 12

文章标签： YOLO python 目标检测

于 2024-01-17 19:09:10 首次发布

本文链接：https://blog.csdn.net/zyt_820/article/details/135654076

版权

文章详细描述了如何使用Python自动将数据集按比例划分为训练、验证和测试部分，以适应YOLO模型需求。

摘要由CSDN通过智能技术生成

当你兴致勃勃想做目标检测，并且已经使用各种软件（labelme、labelimg等）标好了数据集，现在需要对数据集进行划分，使其满足YOLO模型的要求

你的数据集是这样：

📂 你的数据集
├── 📂 images
│ ├── 📄 image1.jpg
│ ├── 📄 image2.jpg
│ └── ...
├── 📂 labels
│ ├── 📄 label1.txt
│ ├── 📄 label2.txt
│ └── ...
└── ...

你想把它随机划分变成这样：

📂 数据集
├── 📂 images
│ ├── 📂 train
│ │ ├── 📄 train_image1.jpg
│ │ ├── 📄 train_image2.jpg
│ │ └── ...
│ ├── 📂 val
│ │ ├── 📄 val_image1.jpg
│ │ ├── 📄 val_image2.jpg
│ │ └── ...
│ └── 📂 test
│ ├── 📄 test_image1.jpg
│ ├── 📄 test_image2.jpg
│ └── ...
└── 📂 labels
├── 📂 train
│ ├── 📄 train_label1.txt
│ ├── 📄 train_label2.txt
│ └── ...
├── 📂 val
│ ├── 📄 val_label1.txt
│ ├── 📄 val_label2.txt
│ └── ...
└── 📂 test
├── 📄 test_label1.txt
├── 📄 test_label2.txt
└── ...

就可以尝试使用，我的同学们说很鸡肋，但我觉得很方便的，以下python代码自动实现划分啦！

"/path/to/your/images" 你的数据集中存放图像的路径

"/path/to/your/labels" 你的数据集中存放标签的路径

"/path/to/your/output/folder" 最后输出的文件夹的名称，自定义

①随机划分6:2:2 训练集6、验证集2、测试集2-----6:2:2

将 "/path/to/your/images", "/path/to/your/labels", 和 "/path/to/your/output/folder" 替换为实际的文件夹路径。

import os
import shutil
import random

def split_dataset(input_image_folder, input_annotation_folder, output_folder, validation_ratio=0.2, test_ratio=0.2):
    # 获取所有图像文件
    images = [file for file in os.listdir(input_image_folder) if file.endswith(('.jpg', '.JPG'))]

    # 随机划分数据集
    random.shuffle(images)
    total_images = len(images)
    validation_count = int(validation_ratio * total_images)
    test_count = int(test_ratio * total_images)

    val_images = images[:validation_count]
    test_images = images[validation_count: validation_count + test_count]
    train_images = images[validation_count + test_count:]

    # 创建输出文件夹
    os.makedirs(os.path.join(output_folder, 'images', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'images', 'val'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'images', 'test'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'val'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'test'), exist_ok=True)

    # 移动图像和标注文件到相应的文件夹
    for img in train_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'train', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'train', os.path.splitext(img)[0] + '.txt'))

    for img in val_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'val', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'val', os.path.splitext(img)[0] + '.txt'))

    for img in test_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'test', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'test', os.path.splitext(img)[0] + '.txt'))

# 指定包含原始图像的文件夹路径
input_image_folder = "/path/to/your/images"

# 指定包含原始标注的文件夹路径
input_annotation_folder = "/path/to/your/labels"

# 指定用于存储划分后数据集的文件夹路径
output_dataset_folder = "/path/to/your/output/folder"

# 指定验证集比例（默认为0.2，即20%的数据用于验证）
validation_ratio = 0.2

# 指定测试集比例（默认为0.2，即20%的数据用于测试）
test_ratio = 0.2

# 调用划分数据集的函数
split_dataset(input_image_folder, input_annotation_folder, output_dataset_folder, validation_ratio, test_ratio)

如果想要改变数据比例，如8:1:1，只需将以下两个参数修改即可：validation_ratio = 0.1，test_ratio = 0.1

②随机划分8:2 训练集8、验证集2-----8:2

将 "/path/to/your/images", "/path/to/your/labels", 和 "/path/to/your/output/folder" 替换为实际的文件夹路径。

import os
import shutil
import random

def split_dataset(input_image_folder, input_annotation_folder, output_folder, validation_ratio=0.2):
    # 获取所有图像文件
    images = [file for file in os.listdir(input_image_folder) if file.endswith(('.jpg', '.JPG'))]

    # 随机划分数据集
    random.shuffle(images)
    total_images = len(images)
    validation_count = int(validation_ratio * total_images)

    val_images = images[:validation_count]
    train_images = images[validation_count:]

    # 创建输出文件夹
    os.makedirs(os.path.join(output_folder, 'images', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'images', 'val'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'val'), exist_ok=True)

    # 移动图像和标注文件到相应的文件夹
    for img in train_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'train', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'train', os.path.splitext(img)[0] + '.txt'))

    for img in val_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'val', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'val', os.path.splitext(img)[0] + '.txt'))

# 指定包含原始图像的文件夹路径
input_image_folder = "/path/to/your/images"

# 指定包含原始标注的文件夹路径
input_annotation_folder = "/path/to/your/labels"

# 指定用于存储划分后数据集的文件夹路径
output_dataset_folder = "/path/to/your/output/folder"

# 指定验证集比例（默认为0.2，即20%的数据用于验证）
validation_ratio = 0.2

# 调用划分数据集的函数
split_dataset(input_image_folder, input_annotation_folder, output_dataset_folder, validation_ratio)

③随机抽取指定数据量的训练集和验证集，剩下均为测试集

这里用600和200为训练集和验证集，剩下的全部进入测试集为例，记得将 "/path/to/your/images", "/path/to/your/labels", 和 "/path/to/your/output/folder" 替换为实际的文件夹路径。

import os
import shutil
import random

def split_dataset(input_image_folder, input_annotation_folder, output_folder, train_size=1600, validation_size=400):
    # 获取所有图像文件
    images = [file for file in os.listdir(input_image_folder) if file.endswith(('.jpg', '.JPG'))]

    # 随机划分数据集
    random.shuffle(images)

    # 确保训练集和验证集的大小不超过总数
    train_size = min(train_size, len(images) - validation_size)
    validation_count = min(validation_size, len(images) - train_size)

    # 划分数据集
    train_images = images[:train_size]
    validation_images = images[train_size: train_size + validation_count]
    test_images = images[train_size + validation_count:]

    # 创建输出文件夹
    os.makedirs(os.path.join(output_folder, 'images', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'images', 'val'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'images', 'test'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'train'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'val'), exist_ok=True)
    os.makedirs(os.path.join(output_folder, 'labels', 'test'), exist_ok=True)

    # 移动图像和标注文件到相应的文件夹
    for img in train_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'train', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'train', os.path.splitext(img)[0] + '.txt'))

    for img in validation_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'val', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'val', os.path.splitext(img)[0] + '.txt'))

    for img in test_images:
        image_path = os.path.join(input_image_folder, img)
        shutil.copy(image_path, os.path.join(output_folder, 'images', 'test', img))
        shutil.copy(os.path.join(input_annotation_folder, os.path.splitext(img)[0] + '.txt'),
        os.path.join(output_folder, 'labels', 'test', os.path.splitext(img)[0] + '.txt'))

# 指定包含原始图像的文件夹路径
input_image_folder = "/path/to/your/images"

# 指定包含原始标注的文件夹路径
input_annotation_folder = "/path/to/your/labels"

# 指定用于存储划分后数据集的文件夹路径
output_dataset_folder = "/path/to/your/output/folder"

# 指定训练集大小和验证集大小
train_size = 600
validation_size = 200

# 调用划分数据集的函数
split_dataset(input_image_folder, input_annotation_folder, output_dataset_folder, train_size, validation_size)