YOLO数据集划分教程：如何划分训练、验证和测试集

小白熊XBX

已于 2024-12-01 21:53:01 修改

阅读量8.4k

点赞数 44

文章标签： YOLO sklearn 机器学习人工智能 python

于 2024-10-04 16:01:05 首次发布

本文链接：https://blog.csdn.net/m0_59197405/article/details/142703955

版权

YOLO数据集划分教程：如何划分训练、验证和测试集

关于作者

作者：小白熊

作者简介：精通python、matlab、c#语言，擅长机器学习，深度学习，机器视觉，目标检测，图像分类，姿态识别，语义分割，路径规划，智能优化算法，数据分析，各类创新融合等等。

联系邮箱：xbx3144@163.com

科研辅导、知识付费答疑、个性化定制以及其他合作需求请联系作者~

前言

在目标检测任务中，YOLO是一种非常流行的检测模型。训练YOLO模型时，数据集通常需要划分为训练集、验证集和测试集，以评估模型的性能。本文将介绍如何使用Python进行数据集的划分，并将图像和标签文件按相应比例划分到不同文件夹中。

具体步骤

假设我们有一组标注好的YOLO数据集，包含图像文件（如jpg格式）和对应的标签文件（txt格式）。我们希望将这些文件按一定比例划分为训练集（train）、验证集（val）和测试集（test）。具体步骤如下：

第一步：导入环境

import os
import shutil
from sklearn.model_selection import train_test_split

os：用于操作系统文件路径和目录。
shutil：用于复制文件。
train_test_split：用于划分数据集，随机将数据分配到不同子集中。

第二步：设置参数

在进行数据集划分前，我们需要定义一些重要参数：

val_size：验证集的比例；
test_size：测试集的比例；
postfix：图像文件的后缀（例如jpg）；
imgpath：图像文件所在的目录；
txtpath：标签文件所在的目录；
new_imgpath：划分后的图像文件保存路径；
new_txtpath：划分后的标签文件保存路径。

val_size = 0.1
test_size = 0.2
_val_size = val_size / (1 - test_size) # 计算验证集占去掉测试集后的比例

postfix = 'jpg'
imgpath = './data1/images'
txtpath = './data1/labels'

new_imgpath = './data/images'
new_txtpath = './data/labels'

在这个例子中，验证集占比10%，测试集占比20%。

第三步：创建目标目录

为了保证划分后的文件能正确存储，我们需要预先创建相应的目录。这里将会为训练集、验证集和测试集分别创建独立的文件夹。

os.makedirs(os.path.join(new_imgpath, 'train'), exist_ok=True)
os.makedirs(os.path.join(new_imgpath, 'val'), exist_ok=True)
os.makedirs(os.path.join(new_imgpath, 'test'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'train'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'val'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'test'), exist_ok=True)

exist_ok=True保证如果文件夹已经存在，不会抛出错误。

第四步：划分数据集

通过遍历标签文件目录中的所有txt文件，我们将标签文件和对应的图像文件分成训练集、验证集和测试集。具体的操作使用train_test_split函数。

listdir = [i for i in os.listdir(txtpath) if 'txt' in i]
train, test = train_test_split(listdir, test_size=test_size, shuffle=True, random_state=0)
train, val = train_test_split(train, test_size=_val_size, shuffle=True, random_state=0)

print(f'train set size:{len(train)} val set size:{len(val)} test set size:{len(test)}')

train_test_split：随机将数据集按比例划分。
shuffle=True：打乱数据顺序，以保证划分的随机性。
random_state=0：保证每次运行代码时划分结果相同，方便调试和复现。

第五步：复制文件

接下来，我们将根据划分结果，把对应的图像和标签文件复制到相应的文件夹中。首先是训练集：

for i in train:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'train/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'train/{}'.format(i)))
    except Exception as e:
        print(e)

shutil.copy：用于将文件从一个目录复制到另一个目录。
i[:-4]：去掉文件名的后缀（txt），以获取对应的图像文件名。

同样的方式，我们也可以复制验证集和测试集的文件：

for i in val:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'val/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'val/{}'.format(i)))
    except Exception as e:
        print(e)

for i in test:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'test/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'test/{}'.format(i)))
    except Exception as e:
        print(e)

完整代码

import os, shutil
from sklearn.model_selection import train_test_split

val_size = 0.1
test_size = 0.2
_val_size = val_size / (1 - test_size) # 计算验证集占去掉测试集后的比例

postfix = 'jpg'
imgpath = './data1/images'
txtpath = './data1/labels'

new_imgpath = './data/images'
new_txtpath = './data/labels'

os.makedirs(os.path.join(new_imgpath, 'train'), exist_ok=True)
os.makedirs(os.path.join(new_imgpath, 'val'), exist_ok=True)
os.makedirs(os.path.join(new_imgpath, 'test'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'train'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'val'), exist_ok=True)
os.makedirs(os.path.join(new_txtpath, 'test'), exist_ok=True)

listdir = [i for i in os.listdir(txtpath) if 'txt' in i]
train, test = train_test_split(listdir, test_size=test_size, shuffle=True, random_state=0)
train, val = train_test_split(train, test_size=_val_size, shuffle=True, random_state=0)
print(f'train set size:{len(train)} val set size:{len(val)} test set size:{len(test)}')

for i in train:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'train/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'train/{}'.format(i)))
    except Exception as e:
        print(e)

for i in val:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'val/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'val/{}'.format(i)))
    except Exception as e:
        print(e)

for i in test:
    try:
        shutil.copy('{}/{}.{}'.format(imgpath, i[:-4], postfix),
                    os.path.join(new_imgpath, 'test/{}.{}'.format(i[:-4], postfix)))
        shutil.copy('{}/{}'.format(txtpath, i), os.path.join(new_txtpath, 'test/{}'.format(i)))
    except Exception as e:
        print(e)