1. 背景
git clone远程代码仓库时,大家时候只想下载其中某个目录或子目录。比如基于mindspore/models开源开发,实现widedeep模型训练任务时,需要下载其“official/recommend/wide_and_deep”目录代码到本地,做为代码基准,来实现自己的训练和推理业务。
习惯的做法时git clone整个工程到本地。如果有版本诉求需要精准控制代码的commitID,那么可以git reset --hard XXXX(commitID)后,把指定目录提取出来做为开发代码基准。这种做法在克隆少数次数时感觉不到什么问题,但如果要实现mindspore框架几十个训练模型,难道要同一个工程下载几十遍?mindspore工程代码量不大,如果代码仓库是1G以上,你的网络下载速度又不高,那问题就来了。
那么又有人问了,我下载一次仓库代码到本地,分别去取需要的各个模型目录,不是更效率更高?这个时候如果你各个模型代码基准版本都相同,没问题。如果版本不同(截取commitID不同,比如mindspore r1.1, r1.2, r1.3,r1.4,r1.5,r1.6),只下载一份仓库代码就不够用了。
本文章着重解决以下问题:
- 要求精准克隆。只需要远程代码仓库的某个目录代码或嵌套子目录代码。类似行为多于1次。时间成本和存储成本有考虑的必要
- 下载的指定目录有版本控制的需求。不总是最新代码
2.实现的思路
通过网络搜索,有很多文章,鱼目混珠。有的说用tortoisesvn,可以实现,但丢失了git信息,无法进行版本控制。有的用git实现了,但未介绍下载嵌套子目录的方法。实际上仓库根目录下目录和嵌套子目录实现方法一样。
Git1.7.0以后加入了Sparse Checkout模式,该模式可以实现Check Out指定文件或者文件夹。
以下实现示例过程以码云https://gitee.com/mindspore/models代码仓库wide_and_deep r1.5 训练模型代码检出做为示范例子。文末有示例脚本附件。
主要步骤:
# create test folder
midir test;cd test
# necessary step for git version control
git init
git remote add origin https://gitee.com/mindspore/models.git
git config core.sparsecheckout true
# only need given specified nested subdirectory of "wide_and deep”
echo 'official/recommend/wide_and_deep' >> .git/info/sparse-checkout
# 以master分支为例,可以更换其它分支
git pull origin master
# 代码控制在支持mindspore r1.5版本的指定commitID
git reset --hard 5a4ff4e3dc9bcb46dbb71b6b16fbadbb68c5e8dc
特别说明:当工程目录是根目录时,根目录名字前后加“/”,如“/build/”,会精确匹配根目录,否则会匹配上其它目录的同名目录,产生冗余下载。
3.实现过程记录
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp
$ mkdir test
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp
$ cd test
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test
$ git init
Initialized empty Git repository in E:/codes/tmp/test/.git/
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master)
$ git remote add origin https://gitee.com/mindspore/models.git
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master)
$ git remote -v
origin https://gitee.com/mindspore/models.git (fetch)
origin https://gitee.com/mindspore/models.git (push)
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master)
$ git config core.sparsecheckout true
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ echo 'official/recommend/wide_and_deep' >> .git/info/sparse-checkout
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ cat .git/info/sparse-checkout
official/recommend/wide_and_deep
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ tree -a
.
`-- .git
|-- HEAD
|-- config
|-- description
|-- hooks
| |-- applypatch-msg.sample
| |-- commit-msg.sample
| |-- fsmonitor-watchman.sample
| |-- post-update.sample
| |-- pre-applypatch.sample
| |-- pre-commit.sample
| |-- pre-merge-commit.sample
| |-- pre-push.sample
| |-- pre-rebase.sample
| |-- pre-receive.sample
| |-- prepare-commit-msg.sample
| |-- push-to-checkout.sample
| `-- update.sample
|-- info
| |-- exclude
| `-- sparse-checkout
|-- objects
| |-- info
| `-- pack
`-- refs
|-- heads
`-- tags
9 directories, 18 files
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ git pull origin master
remote: Enumerating objects: 5904, done.
remote: Counting objects: 100% (5904/5904), done.
remote: Compressing objects: 100% (2797/2797), done.
remote: Total 18385 (delta 3664), reused 4338 (delta 2996), pack-reused 12481
Receiving objects: 100% (18385/18385), 64.92 MiB | 5.93 MiB/s, done.
Resolving deltas: 100% (10353/10353), done.
From https://gitee.com/mindspore/models
* branch master -> FETCH_HEAD
* [new branch] master -> origin/master
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ tree -L 4
.
`-- official
`-- recommend
`-- wide_and_deep
|-- README.md
|-- README_CN.md
|-- ascend310_infer
|-- default_config.yaml
|-- eval.py
|-- export.py
|-- mindspore_hub_conf.py
|-- postprocess.py
|-- preprocess.py
|-- requirements.txt
|-- script
|-- src
|-- train.py
|-- train_and_eval.py
|-- train_and_eval_auto_parallel.py
|-- train_and_eval_distribute.py
|-- train_and_eval_parameter_server_distribute.py
`-- train_and_eval_parameter_server_standalone.py
6 directories, 15 files
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ git reset --hard 5a4ff4e3dc9bcb46dbb71b6b16fbadbb68c5e8dc
HEAD is now at 5a4ff4e3 !813 add ascend310 infer Merge pull request !813 from jkmopl/master
Administrator@DESKTOP-NRIERC2 MINGW64 /e/codes/tmp/test (master|SPARSE)
$ git log
commit 5a4ff4e3dc9bcb46dbb71b6b16fbadbb68c5e8dc (HEAD -> master)
Merge: 789f442c 48f30e33
Author: i-robot <huawei_ci_bot@163.com>
Date: Mon Nov 22 07:53:50 2021 +0000
!813 add ascend310 infer
Merge pull request !813 from jkmopl/master
commit 789f442c1b2273989dbc4a4c2ce5c762ed5cff8f
Merge: 9c321da6 646c369f
Author: i-robot <huawei_ci_bot@163.com>
Date: Mon Nov 22 01:53:51 2021 +0000
!171 [哈尔滨工业大学威海][高校贡献][mindspore][deeplabv3plus]-310提交
Merge pull request !171 from kzx2020/master
4.sparse-checkout 文件设置
请参考参考文献1
5. 应用示例
集成脚本git_clone_only_given_folder.sh
#!/bin/bash
echo_help()
{
echo "usage:"
echo " # if project_given_folder is root one,project_given_folder should be added '/' at the head such as '/build/'"
echo " ./git_clone_only_given_folder.sh absolute_target_path download_url project_given_folder branch"
echo " when lack of branch parameter, default master branch"
}
if [ $1 == "--help" ] || [ $1 == "-h" ];then
{
echo_help
exit 1
}
fi
path=$1
url=$2
folder=$3
branch=master
if [ $# == 5 ];then
branch=$4
fi
rm -rf $path
mkdir -p $path
cd $path || exit 1
git init
git remote add origin $url
git config core.sparsecheckout true
echo $folder >> .git/info/sparse-checkout
git pull origin $branch
cd ..
应用脚本调用示例:
./git_clone_only_given_folder.sh '/home/test/' 'https://gitee.com/mindspore/models.git' 'official/recommend/wide_and_deep' master
6.参考文献
[1] yanlong107, git sparse checkout (稀疏检出), https://www.jianshu.com/p/680f2c6c84de