【Colab】如何将超大/超多文件上传到Colab并读取

最新推荐文章于 2024-06-02 09:35:07 发布

K_steven

最新推荐文章于 2024-06-02 09:35:07 发布

阅读量1.6k

点赞数 1

分类专栏： OWOD学习笔记文章标签： python 远程工作深度学习 database

本文链接：https://blog.csdn.net/stevenZXZ/article/details/132328113

版权

OWOD学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

0. 使用背景

用Colab做深度学习的时候需要用到一个很大的数据集，其中Annotation文件有15w+个（100+mb），图片文件有16w+个（24Gb），按照最普通的方式上传会出现文件丢失、速度慢、读取中断的问题。经过多种尝试总结出以下方法，目前可以解决部分问题，如有改进请多指点。

1. 将大量文件分成多个目录

将15w+个文件分成155个目录，其中每个目录包含1000个文件，若最后剩余不足1000个放在第155个目录。
代码：

# -- coding: utf-8 --
import os
import shutil


# Source directory containing the 100 files
source_directory = 'D:\Download\Annotations\Annotations'

# Destination directory where subfolders will be created
destination_directory = 'D:\Download\Annotations\Annotations_split'

# Create destination subfolders
for i in range(1, 156):  # 新建155个文件夹
    subfolder_path = os.path.join(destination_directory, f'Annotations_{i}')
    os.makedirs(subfolder_path, exist_ok=True)

# Get a list of filenames in the source directory
file_list = os.listdir(source_directory)

# Sort the filenames (assuming they are named in a way that can be sorted)
sorted_files = sorted(file_list)

# Split the sorted filenames into groups of 10
file_groups = [sorted_files[i:i+1000] for i in range(0, len(sorted_files), 1000)]

# Move files to the respective subfolders
for i, group in enumerate(file_groups):
    for filename in group:
        source_path = os.path.join(source_directory, filename)
        destination_path = os.path.join(destination_directory, f'Annotations_{i+1}', filename)
        shutil.copy(source_path, destination_path)

print("Files have been ranked and split into subfolders.")

2. 本地压缩，Colab端上传

我这里的压缩格式是zip，稍后在colab中解压也是用unzip即可
在这里插入图片描述
【注意⚠】上传压缩文件到Colab的时候，先上传到根目录下（/content/下，而不是drive/下，因为在/content/下解压会快很多）

如果从这里上传失败（过大）的话可以尝试从GoogleDrive上传：

3. 解压zip文件

!unzip [path_to_file.zip]

解压过程中可能会出现：
在这里插入图片描述
但不用在意，程序还在运行当中。
此时的文件已经被解压到根目录（/content/）下。

4. 移动文件到指定位置

!mv /content/xxx.zip /content/drive/MyDrive/target_dir/

大概计算解压的时间，多等一段时间后可以从GoogleDrive查看是否转移成功。

5. 改写代码中读取文件部分（参考）

因为我们把本来存放在一个目录下的文件放到了若干个目录下，所以如果代码中要检索文件会找不到，需要改写为遍历子目录检索
如果有需要可以参考以下代码：

import os
annotation_dirname = "D:\Download\Annotations\Annotations_split"
for fileid in ['000001', '000009', '0223']:
    anno_file = ""
    anno_file_name = fileid + ".xml"
    for dirpath, dirnames, filenames in os.walk(annotation_dirname):
        if anno_file_name in filenames:
            print("found %s in %s !!!" % (anno_file_name, dirpath))
            anno_file = os.path.join(dirpath, anno_file_name)
            break
    if anno_file == "":
        print("%s not exist" % anno_file_name)
        continue
    print(anno_file)

K_steven

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
8
评论
【Colab】如何将超大/超多文件上传到Colab并读取

用Colab做深度学习的时候需要用到一个很大的数据集，其中Annotation文件有15w+个（100+mb），图片文件有16w+个（24Gb），按照最普通的方式上传会出现文件丢失、速度慢、读取中断的问题。经过多种尝试总结出以下方法，目前可以解决部分问题，如有改进请多指点。因为我们把本来存放在一个目录下的文件放到了若干个目录下，所以如果代码中要检索文件会找不到，需要改写为遍历子目录检索。将15w+个文件分成155个目录，其中每个目录包含1000个文件，若最后剩余不足1000个放在第155个目录。
复制链接

扫一扫