【Gitee】大文本文件分割成小文件,使用gitpython上传到Gitee仓库

问题描述

  本地有约1GB的文本文件数据,准备上传到Gitee仓库。但是上传限制单个文件大小为100MB,因此将其分成单个99MB的小文件,共11个。在本地一块push的时候还是失败:

图
  Compressing(压缩)的时候正常,但是Writting(写入)的时候就会断开连接。说明可能不仅限制单个文件大小,本地仓库整个推送到远程的时候也会限制总的大小,因为我手动挨个在本地添加小文件(99MB)再推送,它又行了。这样手动添加再推送又很麻烦。
  解决思路:结合文本分割与gitpython代码自动推送,实现每分割出一个小文件,就推送一次,避免一次性推送全部小文件导致的连接断开。

文本文件分割

  对于文本文件,基本做法有两种:一是按行从源文件读出,分割行内容,如100行内容分成2份50行内容写入2个目标文件;二是按字节从源文件读出,再写入目标文件。本文按字节分割,可以控制单个文件大小。按字节读取的核心代码:

# 读取,bytes是字节数
content = yourfile1.read(bytes)
# 写入
yourfile2.write(content)

  因为是按字节读取,文件必须以二进制形式打开,读取则函数open的模式为rb,写入则open的模式为wb

def SplitFiles(src_dir, tgt_dir, sub_file_size=99, max_size=100):
    """
    Split big text file into small files which are not bigger than 100 MB.
    :param src_dir: *source directory where the source files you want to split into small files are.*
    :param tgt_dir: *target directory where small files you want to place after spliting.*
    :param sub_file_size: *size of small files, default 99 MB.*
    :param max_size: *the biggest size of file.*
    :return: *None.*
    """
    source_files = os.listdir(src_dir)
    for file in source_files:
        # For every file in source_files, get size of it.
        file_size = os.stat(src_dir + "/" + file).st_size / 1024**2
        if file_size > max_size:
            # If file is bigger 100 MB, compute how many small files needed.
            sub_file_nums = int(file_size // sub_file_size + 1)
            with open(src_dir + "/" + file, "rb", 0) as source:
                sub_content = source.read(sub_file_size * 1024 ** 2)
                with tqdm(range(sub_file_nums), desc=file + " processing") as tbar:
                    # For every small files, build its path and write source file content into it.
                    for i in tbar:
                        target_path = tgt_dir + "/" + file.replace(".", "sub{}.".format(i))
                        with open(target_path, "wb", 0) as target:
                            target.write(sub_content)
                        sub_content = source.read(sub_file_size * 1024 ** 2)
        else:
            pass
    return None

gitpython代码自动化上传Gitee

  gitpython给了不少实现与Git Bash中输入命令行相同效果的接口,本文使用与Git Bash中命令近似的写法,便于记忆。奇怪的是在本人操作时未设置什么免密登录就直接push到Gitee上了。下面的代码中push的方式可能有所不同,这与我在Gitee仓库上的管理方式有关,感兴趣的读者请点击:参考

from git import Repo


def PushRemote(dir_path, remote_repo_url):
    """
    Test commands which are used to upload files to Gitee.
    :param dir_path: *path of directory which will be used as local repository.*
    :param remote_repo_url: *url of remote repository, it's like "git@gitee.com:·····.git"*
    :return: *None.*
    """
    with open(dir_path + "/hello.txt", "w", encoding="utf-8") as file:
        file.write("hello!")
    # Initialize local repository.
    repository_local = Repo.init(dir_path)
    # Add and commit.
    repository_local.git.add("hello.txt")
    repository_local.git.commit("-m", "上传文件hello.txt")
    # Create new branch and checkout.
    repository_local.git.branch("new_branch")
    repository_local.git.checkout("new_branch")
    # Under new branch, connect remote repository and push.
    repository_local.git.remote("add", "remote", remote_repo_url)
    repository_local.git.push("remote", "new_branch")

融合代码

def FilesSplitAndUpload(src_dir, tgt_dir, remote_repo_url, sub_file_size=99, max_size=100):
    """
    Split big files and upload them to Gitee.
    :param src_dir: *source directory where the source files you want to split into small files are.*
    :param tgt_dir: *target directory where small files you want to place after spliting.*
    :param remote_repo_url: *url of remote repository of Gitee, it's like "git@gitee.com:···.git"*
    :param sub_file_size: *size of small files, default 99 MB.*
    :param max_size: *the biggest size of file.*
    :return: *None.*
    """
    # Initialize local repository.
    repo_local = Repo.init(tgt_dir)
    # Create temporary file to commit before creating new branch.
    with open(tgt_dir + "/temp.txt", "w", encoding="utf-8") as file:
        file.write("hello!")
    repo_local.git.add("temp.txt")
    repo_local.git.commit("-m", "create a temporary file")
    repo_local.git.branch(os.path.basename(tgt_dir))
    repo_local.git.checkout(os.path.basename(tgt_dir))
    repo_local.git.remote("add", os.path.basename(tgt_dir), remote_repo_url)
    # Delete temporary file temp.txt avoiding upload together with target files.
    os.remove(tgt_dir + "/temp.txt")
    source_files = os.listdir(src_dir)
    for file in source_files:
        # For every file in source_files, get size of it.
        file_size = os.stat(src_dir + "/" + file).st_size / 1024 ** 2
        if file_size > max_size:
            # If file is bigger 100 MB, compute how many small files needed.
            sub_file_nums = int(file_size // sub_file_size + 1)
            with open(src_dir + "/" + file, "rb", 0) as source:
                sub_content = source.read(sub_file_size * 1024 ** 2)
                with tqdm(range(sub_file_nums), desc=file + " processing") as tbar:
                    # For every small files, build its path and write source file content into it.
                    for i in tbar:
                        target_path = tgt_dir + "/" + file.replace(".", "sub{}.".format(i))
                        with open(target_path, "wb", 0) as target:
                            target.write(sub_content)
                        # Write new file and push the local repository.
                        repo_local.git.add(".")
                        repo_local.git.commit("-m", "upload new data file {}".format(file.replace(".", "sub{}.".format(i))))
                        repo_local.git.push(os.path.basename(tgt_dir), os.path.basename(tgt_dir))
                        sub_content = source.read(sub_file_size * 1024 ** 2)
        else:
            pass
    return None

总结

  为了方便使用,我设计成以命令行的形式:python myfile.py --source_dir=a/b/c --target_dir=d/e/f --remote_repo_url=git@gitee.com:···.git运行python脚本。只需输入被分割文件所在的目录source_dir、分割后子文件放置目录target_dir和Gitee仓库地址remote_repo_url即可,要求source_dir只包含被分割文件,target_dir在处理前为空文件夹。

完整源码

文本文件分割并上传Gitee项目源码

            创作不易,如果有所帮助,求点赞收藏加关注,谢谢!

图

  • 25
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值