深度学习的训练数据集理论上当然是越大越好,但是有的时候我们为了快速验证和测试,并不需要超大量的训练数据(尤其是未标注的数据)那么从中均匀分布挑选数据就是一个实际的需求。
然后我就造了一个轮子来解决这个问题(带进度条的哟)。当然这个所谓的均匀挑选是基于文件名排序的(例如使用我的这篇博文的代码来从视频提取出的图片,排序后按照一定间隔挑选就是基本均匀的,也就是每个视频的图片都能照顾得到。)
使用案例:
#将"~/Movies/cat_video_frames"文件夹中的图像每9个文件挑选一个出来,默认保存目标文件夹是“源文件夹_picked"
python3 pick_files.py --path="~/Movies/cat_video_frames" --n=9
效果:
下面是代码:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
###
# File: /Users/simonliu/Documents/python/pick_files/pick_files.py
# Project: /Users/simonliu/Documents/python/pick_files
# Created Date: 2022-05-20 20:21:14
# Author: Simon Liu
# -----
# Last Modified: 2022-05-21 19:03:44
# Modified By: Simon Liu
# -----
# Copyright (c) 2022 SimonLiu Inc.
#
# May the force be with you.
# -----
# HISTORY:
# Date By Comments
# ---------- --- ----------------------------------------------------------
###
import argparse
from pathlib import Path
import shutil
import time
from pkg_resources import to_filename
def get_total_file_count(dir):
total_file_count = 0
for f in dir.glob('**/*'):
if f.is_file():
total_file_count += 1
else:
pass
return total_file_count
def main():
parser = argparse.ArgumentParser(description="2 arguments: path and file src_count")
parser.add_argument("--path", type=str, default="./")
parser.add_argument("--n", type=int, default=2)
args = parser.parse_args()
dir = args.path
src_path = Path.expanduser(Path(dir))
n = args.n
if(not src_path.exists()):
print(f"文件夹{dir}不存在,请检查输入的文件夹名称是否正确。")
return
dst_path = src_path.parent/"{}_picked".format(src_path.name)
if not dst_path.exists():
dst_path.mkdir(parents=True)
print(f"源文件夹:{src_path}\n新文件夹:{dst_path}\n文件挑选间隔:{n}")
src_count = get_total_file_count(src_path)
dst_count = int(src_count/n)
# 源文件遍历的当前文件
src_index = 0
# 已复制的文件数
picked_count = 0
# 进度条长度
t = 60
# 目标文件数与进度条比例(需要等比压缩)
ratio = dst_count/60
print(f"即将从{src_count}个文件中均匀挑选{dst_count}个文件。")
src_file_list = sorted(src_path.glob("**/*"), key = lambda f : f.stem)
start = time.perf_counter()
for f in src_file_list:
if f.is_file():
if src_index%n == 0 and picked_count <= dst_count:
dst = dst_path/f.name
shutil.copy(f, dst)
finish = "▓" * int(picked_count/ratio)
rest = "-" * int((dst_count - picked_count)/ratio)
progress = (picked_count / dst_count) * 100
dur = time.perf_counter() - start
print("\r{:>3.0f}% {}{} {:.2f}s".format(progress, finish, rest, dur), end="")
time.sleep(0.002)
picked_count += 1
src_index += 1
if __name__ == "__main__":
main()