对大文件压缩包分割和恢复的方法_python

1.前言

        某次需要将大的压缩包分割传输,并恢复。找到了一段有用的python程序。

        这个软件包可以压缩和分割大文件。它从一个根目录开始,遍历子目录,并扫描其中的每个文件。如果某个文件的大小超过了阈值大小,那么它们会被压缩和分割成多个归档文件,每个归档文件的最大大小为分区大小。压缩/分割适用于任何文件扩展名。

举例:

对于目录

$ tree --du -h ~/MyFolder

└── [415M]  My Datasets
│   ├── [6.3K]  Readme.txt
│   └── [415M]  Data on Leaf-Tailed Gecko
│       ├── [ 35M]  DatasetA.zip
│       ├── [ 90M]  DatasetB.zip
│       ├── [130M]  DatasetC.zip
│       └── [160M]  Books
│           ├── [ 15M]  RegularBook.pdf
│           └── [145M]  BookWithPictures.pdf
└── [818M]  Video Conference Meetings
    ├── [817M]  Discussion_on_Fermi_Paradox.mp4
    └── [1.1M]  Notes_on_Discussion.pdf

使用

$ python3 src/main.py  --root_dir ~/MyFolder

目录变成

$ tree --du -h ~/MyFolder

└── [371M]  My Datasets
│   ├── [6.3K]  Readme.txt
│   └── [371M]  Data on Leaf-Tailed Gecko
│       ├── [ 35M]  DatasetA.zip
│       ├── [ 90M]  DatasetB.zip
│       ├── [ 95M]  DatasetC.zip.7z.001
│       ├── [ 18M]  DatasetC.zip.7z.002
│       └── [133M]  Books
│           ├── [ 15M]  RegularBook.pdf
│           ├── [ 95M]  BookWithPictures.pdf.7z.001
│           └── [ 23M]  BookWithPictures.pdf.7z.002
└── [794M]  Video Conference Meetings
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.001
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.002
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.003
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.004
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.005
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.006
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.007
    ├── [ 95M]  Discussion_on_Fermi_Paradox.mp4.7z.008
    ├── [ 33M]  Discussion_on_Fermi_Paradox.mp4.7z.009
    └── [1.1M]  Notes_on_Discussion.pdf

使用

$ python3 src/reverse.py  --root_dir ~/MyFolder

则恢复到原始文件。

2.环境准备

2.1 python3

        本地已经安装 Python 3.x.x.

2.2 7z库文件下载安装

        虽然在src/main.py中遍历目录是串行的,但是通过7z压缩/分割每个文件在默认情况下是并行的。
        使用src/reverse.py进行反转完全是串行的。

3.分割

用于分割大文件的代码main.py如下:

import sys  # 导入sys模块,用于退出程序
import os  # 导入os模块,用于文件和目录操作
import shutil  # 导入shutil模块,用于文件操作
import subprocess  # 导入subprocess模块,用于执行shell命令
import argparse  # 导入argparse模块,用于解析命令行参数


def parse_arguments():
    # 解析命令行参数
    parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles')

    parser.add_argument('--root_dir', type=str, default=os.getcwd(),
                        help="Root directory to start traversing. Defaults to current working directory.")
    parser.add_argument('--delete_original', type=bool, default=True,
                        help="Do you want to delete the original (large) file after compressing to archives?")
    parser.add_argument('--partition_ext', type=str, default="7z", choices=["7z", "xz", "bzip2", "gzip", "tar", "zip", "wim"],
                        help="Extension of the partitions. Recommended: 7z due to compression ratio and inter-OS compatibility.")
    parser.add_argument('--cmds_into_7z', type=str, default="a",
                        help="Commands to pass in to 7z.")
    parser.add_argument('--threshold_size', type=int, default=100,
                        help="Max threshold of the original file size to split into archive. I.e. files with sizes below this arg are ignored.")
    parser.add_argument('--threshold_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
                        help="Unit of the threshold size specified (bytes, kilobytes, megabytes, gigabytes).")
    parser.add_argument('--partition_size', type=int, default=95,
                        help="Max size of an individual archive. May result in actual partition size to be higher than this value due to disk formatting. In that case, reduce this arg value.")
    parser.add_argument('--partition_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
                        help="Unit of the partition size specified (bytes, kilobytes, megabytes, gigabytes).")

    args = parser.parse_args()
    return args


def check_7z_install():
    # 检查是否安装了7z,如果没有安装则退出程序
    if shutil.which("7z"):
        return True
    else:
        sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")


def is_over_threshold(f_full_dir, args):
    # 判断文件是否超过阈值大小
    size_dict = {
        "b": 1e-0,
        "k": 1e-3,
        "m": 1e-6,
        "g": 1e-9
    }
    return os.stat(f_full_dir).st_size * size_dict[args.threshold_size_unit] >= args.threshold_size


def traverse_root_dir(args):
    # 遍历指定目录下的文件,并进行压缩
    for root, _, files in os.walk(args.root_dir):
        for f in files:
            f_full_dir = os.path.join(root, f)

            if is_over_threshold(f_full_dir, args):
                f_full_dir_noext, ext = os.path.splitext(f_full_dir)
                # 使用7z命令进行压缩
                prc = subprocess.run(["7z", "-v" + str(args.partition_size) + args.partition_size_unit, args.cmds_into_7z,
                                      f_full_dir_noext + "." + ext[1:] + "." + args.partition_ext, f_full_dir])

                if args.delete_original and prc.returncode == 0:
                    os.remove(f_full_dir)


if __name__ == '__main__':
    check_7z_install()  # 检查是否安装了7z
    traverse_root_dir(parse_arguments())  # 压缩文件

这段代码会从root_dir开始遍历所有子目录,并将所有超过100MB的文件压缩为最大大小约为95MB的较小存档文件。默认选项是在压缩后删除原始(大)文件,但可以关闭此选项。

执行记录

D:\tmp\git_di>python main.py  --root_dir "D:\tmp\git_di"

7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20

Scanning the drive:
1 file, 3329165073 bytes (3175 MiB)

Creating archive: D:\tmp\git_di\testfile.zip.7z

Add new data to archive: 1 file, 3329165073 bytes (3175 MiB)


Files read from disk: 1
Archive size: 3304152719 bytes (3152 MiB)
Volumes: 34
Everything is Ok

可以当前目录下生成了多个压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)

4.恢复

用于恢复大文件的代码reverse.py 如下:

import sys  # 导入sys模块,用于退出程序
import os  # 导入os模块,用于文件和目录操作
import shutil  # 导入shutil模块,用于文件操作
import subprocess  # 导入subprocess模块,用于执行shell命令
import argparse  # 导入argparse模块,用于解析命令行参数


def parse_arguments():
    # 解析命令行参数
    parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles_reverse')

    parser.add_argument('--root_dir', type=str, default=os.getcwd(),
                        help="Root directory to start traversing. Defaults to current working directory.")
    parser.add_argument('--delete_partitions', type=bool, default=True,
                        help="Do you want to delete the partition archives after extracting the original files?")

    args = parser.parse_args()
    return args


def check_7z_install():
    # 检查是否安装了7z,如果没有安装则退出程序
    if shutil.which("7z"):
        return True
    else:
        sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")


def is_partition(f_full_dir):
    # 判断文件是否是分卷文件
    return any(f_full_dir.endswith(ext) for ext in
               [".7z.001", ".xz.001", ".bzip2.001", ".gzip.001", ".tar.001", ".zip.001", ".wim.001"])


def reverse_root_dir(args):
    # 遍历指定目录下的文件,并进行解压
    for root, _, files in os.walk(args.root_dir):
        for f in files:
            f_full_dir = os.path.join(root, f)
            if is_partition(f_full_dir):
                # 使用7z解压文件
                prc = subprocess.run(["7z", "e", f_full_dir, "-o" + root])
                if args.delete_partitions and prc.returncode == 0:
                    f_noext, _ = os.path.splitext(f)
                    os.chdir(root)
                    os.system("rm" + " \"" + f_noext + "\"*")


if __name__ == '__main__':
    check_7z_install()  # 检查是否安装了7z
    reverse_root_dir(parse_arguments())  # 解压分卷文件

测试

将压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)放置与目录 D:\tmp\git_di 下,reverse.py 也放在同级目录下。

执行记录

D:\tmp\git_di>python reverse.py --root_dir "D:\tmp\git_di"

7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20

Scanning the drive for archives:
1 file, 99614720 bytes (95 MiB)

Extracting archive: D:\tmp\git_di\testfile.zip.7z.001
--
Path = D:\tmp\git_di\testfile.zip.7z.001
Type = Split
Physical Size = 99614720
Volumes = 34
Total Physical Size = 3304152719
----
Path = testfile.zip.7z
Size = 3304152719
--
Path = testfile.zip.7z
Type = 7z
Physical Size = 3304152719
Headers Size = 162
Method = LZMA2:24
Solid = -
Blocks = 1

Everything is Ok

Size:       3329165073
Compressed: 3304152719
'rm' 不是内部或外部命令,也不是可运行的程序
或批处理文件。

可以看到新生成了文件 testfile.zip。

5.最后

参考github链接

https://github.com/sisl/GitHub-ForceLargeFiles

over.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值