问题描述
本地有约1GB的文本文件数据,准备上传到Gitee仓库。但是上传限制单个文件大小为100MB,因此将其分成单个99MB的小文件,共11个。在本地一块push的时候还是失败:
Compressing(压缩)的时候正常,但是Writting(写入)的时候就会断开连接。说明可能不仅限制单个文件大小,本地仓库整个推送到远程的时候也会限制总的大小,因为我手动挨个在本地添加小文件(99MB)再推送,它又行了。这样手动添加再推送又很麻烦。
解决思路:结合文本分割与gitpython代码自动推送,实现每分割出一个小文件,就推送一次,避免一次性推送全部小文件导致的连接断开。
文本文件分割
对于文本文件,基本做法有两种:一是按行从源文件读出,分割行内容,如100行内容分成2份50行内容写入2个目标文件;二是按字节从源文件读出,再写入目标文件。本文按字节分割,可以控制单个文件大小。按字节读取的核心代码:
# 读取,bytes是字节数
content = yourfile1.read(bytes)
# 写入
yourfile2.write(content)
因为是按字节读取,文件必须以二进制形式打开,读取则函数open
的模式为rb
,写入则open
的模式为wb
。
def SplitFiles(src_dir, tgt_dir, sub_file_size=99, max_size=100):
"""
Split big text file into small files which are not bigger than 100 MB.
:param src_dir: *source directory where the source files you want to split into small files are.*
:param tgt_dir: *target directory where small files you want to place after spliting.*
:param sub_file_size: *size of small files, default 99 MB.*
:param max_size: *the biggest size of file.*
:return: *None.*
"""
source_files = os.listdir(src_dir)
for file in source_files:
# For every file in source_files, get size of it.
file_size = os.stat(src_dir + "/" + file).st_size / 1024**2
if file_size > max_size:
# If file is bigger 100 MB, compute how many small files needed.
sub_file_nums = int(file_size // sub_file_size + 1)
with open(src_dir + "/" + file, "rb", 0) as source:
sub_content = source.read(sub_file_size * 1024 ** 2)
with tqdm(range(sub_file_nums), desc=file + " processing") as tbar:
# For every small files, build its path and write source file content into it.
for i in tbar:
target_path = tgt_dir + "/" + file.replace(".", "sub{}.".format(i))
with open(target_path, "wb", 0) as target:
target.write(sub_content)
sub_content = source.read(sub_file_size * 1024 ** 2)
else:
pass
return None
gitpython代码自动化上传Gitee
gitpython给了不少实现与Git Bash中输入命令行相同效果的接口,本文使用与Git Bash中命令近似的写法,便于记忆。奇怪的是在本人操作时未设置什么免密登录就直接push到Gitee上了。下面的代码中push的方式可能有所不同,这与我在Gitee仓库上的管理方式有关,感兴趣的读者请点击:参考。
from git import Repo
def PushRemote(dir_path, remote_repo_url):
"""
Test commands which are used to upload files to Gitee.
:param dir_path: *path of directory which will be used as local repository.*
:param remote_repo_url: *url of remote repository, it's like "git@gitee.com:·····.git"*
:return: *None.*
"""
with open(dir_path + "/hello.txt", "w", encoding="utf-8") as file:
file.write("hello!")
# Initialize local repository.
repository_local = Repo.init(dir_path)
# Add and commit.
repository_local.git.add("hello.txt")
repository_local.git.commit("-m", "上传文件hello.txt")
# Create new branch and checkout.
repository_local.git.branch("new_branch")
repository_local.git.checkout("new_branch")
# Under new branch, connect remote repository and push.
repository_local.git.remote("add", "remote", remote_repo_url)
repository_local.git.push("remote", "new_branch")
融合代码
def FilesSplitAndUpload(src_dir, tgt_dir, remote_repo_url, sub_file_size=99, max_size=100):
"""
Split big files and upload them to Gitee.
:param src_dir: *source directory where the source files you want to split into small files are.*
:param tgt_dir: *target directory where small files you want to place after spliting.*
:param remote_repo_url: *url of remote repository of Gitee, it's like "git@gitee.com:···.git"*
:param sub_file_size: *size of small files, default 99 MB.*
:param max_size: *the biggest size of file.*
:return: *None.*
"""
# Initialize local repository.
repo_local = Repo.init(tgt_dir)
# Create temporary file to commit before creating new branch.
with open(tgt_dir + "/temp.txt", "w", encoding="utf-8") as file:
file.write("hello!")
repo_local.git.add("temp.txt")
repo_local.git.commit("-m", "create a temporary file")
repo_local.git.branch(os.path.basename(tgt_dir))
repo_local.git.checkout(os.path.basename(tgt_dir))
repo_local.git.remote("add", os.path.basename(tgt_dir), remote_repo_url)
# Delete temporary file temp.txt avoiding upload together with target files.
os.remove(tgt_dir + "/temp.txt")
source_files = os.listdir(src_dir)
for file in source_files:
# For every file in source_files, get size of it.
file_size = os.stat(src_dir + "/" + file).st_size / 1024 ** 2
if file_size > max_size:
# If file is bigger 100 MB, compute how many small files needed.
sub_file_nums = int(file_size // sub_file_size + 1)
with open(src_dir + "/" + file, "rb", 0) as source:
sub_content = source.read(sub_file_size * 1024 ** 2)
with tqdm(range(sub_file_nums), desc=file + " processing") as tbar:
# For every small files, build its path and write source file content into it.
for i in tbar:
target_path = tgt_dir + "/" + file.replace(".", "sub{}.".format(i))
with open(target_path, "wb", 0) as target:
target.write(sub_content)
# Write new file and push the local repository.
repo_local.git.add(".")
repo_local.git.commit("-m", "upload new data file {}".format(file.replace(".", "sub{}.".format(i))))
repo_local.git.push(os.path.basename(tgt_dir), os.path.basename(tgt_dir))
sub_content = source.read(sub_file_size * 1024 ** 2)
else:
pass
return None
总结
为了方便使用,我设计成以命令行的形式:python myfile.py --source_dir=a/b/c --target_dir=d/e/f --remote_repo_url=git@gitee.com:···.git
运行python脚本。只需输入被分割文件所在的目录source_dir
、分割后子文件放置目录target_dir
和Gitee仓库地址remote_repo_url
即可,要求source_dir
只包含被分割文件,target_dir
在处理前为空文件夹。
完整源码
创作不易,如果有所帮助,求点赞收藏加关注,谢谢!