深度学习数据集整理工具（从海量训练集中均匀挑选一定比例的文件）

SimonLiu009

已于 2022-05-21 19:15:23 修改

阅读量559

点赞数

分类专栏：工具类 # Python # 深度学习文章标签： python 深度学习数据集

于 2022-05-21 19:12:15 首次发布

本文链接：https://blog.csdn.net/toopoo/article/details/124901426

版权

Python 同时被 3 个专栏收录

60 篇文章 3 订阅

订阅专栏

工具类

5 篇文章 0 订阅

订阅专栏

深度学习

5 篇文章 0 订阅

订阅专栏

本文介绍了一种Python实现的方法，用于从大量未标注数据中按文件名排序均匀选择一定数量的文件，便于快速验证和测试深度学习模型。通过示例展示了如何从视频帧目录中每9张图片选取一张，提供了实用的代码片段和进度显示功能。

摘要由CSDN通过智能技术生成

深度学习的训练数据集理论上当然是越大越好，但是有的时候我们为了快速验证和测试，并不需要超大量的训练数据（尤其是未标注的数据）那么从中均匀分布挑选数据就是一个实际的需求。

然后我就造了一个轮子来解决这个问题(带进度条的哟)。当然这个所谓的均匀挑选是基于文件名排序的（例如使用我的这篇博文的代码来从视频提取出的图片，排序后按照一定间隔挑选就是基本均匀的，也就是每个视频的图片都能照顾得到。）

使用案例:

#将"~/Movies/cat_video_frames"文件夹中的图像每9个文件挑选一个出来，默认保存目标文件夹是“源文件夹_picked"
python3 pick_files.py --path="~/Movies/cat_video_frames" --n=9

效果：
在这里插入图片描述

下面是代码:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
###
# File: /Users/simonliu/Documents/python/pick_files/pick_files.py
# Project: /Users/simonliu/Documents/python/pick_files
# Created Date: 2022-05-20 20:21:14
# Author: Simon Liu
# -----
# Last Modified: 2022-05-21 19:03:44
# Modified By: Simon Liu
# -----
# Copyright (c) 2022 SimonLiu Inc.
# 
# May the force be with you.
# -----
# HISTORY:
# Date      	By	Comments
# ----------	---	----------------------------------------------------------
###
import argparse
from pathlib import Path
import shutil
import time

from pkg_resources import to_filename

def get_total_file_count(dir):
    total_file_count = 0
    for f in dir.glob('**/*'):
        if f.is_file():
            total_file_count += 1
        else:
            pass
    return total_file_count

def main():
    parser = argparse.ArgumentParser(description="2 arguments: path and file src_count")
    parser.add_argument("--path", type=str, default="./")
    parser.add_argument("--n", type=int, default=2)
    args = parser.parse_args()
    dir = args.path
    src_path = Path.expanduser(Path(dir))
    n = args.n
    if(not src_path.exists()):
        print(f"文件夹{dir}不存在，请检查输入的文件夹名称是否正确。")
        return
    dst_path = src_path.parent/"{}_picked".format(src_path.name)
    if not dst_path.exists():            
            dst_path.mkdir(parents=True)   
    print(f"源文件夹：{src_path}\n新文件夹：{dst_path}\n文件挑选间隔:{n}")
    
    src_count = get_total_file_count(src_path)
    dst_count = int(src_count/n)
    # 源文件遍历的当前文件
    src_index = 0
    # 已复制的文件数
    picked_count = 0
    # 进度条长度
    t = 60
    # 目标文件数与进度条比例（需要等比压缩）
    ratio = dst_count/60
    print(f"即将从{src_count}个文件中均匀挑选{dst_count}个文件。")
    src_file_list = sorted(src_path.glob("**/*"), key = lambda f : f.stem)
    
    start = time.perf_counter()
    for f in src_file_list:
        if f.is_file():
            if src_index%n == 0 and picked_count <= dst_count:
                dst =  dst_path/f.name
                shutil.copy(f, dst)
                finish = "▓" * int(picked_count/ratio)
                rest = "-" * int((dst_count - picked_count)/ratio)
                progress = (picked_count / dst_count) * 100
                dur = time.perf_counter() - start
                print("\r{:>3.0f}% {}{} {:.2f}s".format(progress, finish, rest, dur), end="")
                time.sleep(0.002)
                picked_count += 1
            src_index += 1
 
if __name__ == "__main__":
    main()