如何在git历史中查找/识别大型提交？

最新推荐文章于 2024-02-04 11:11:05 发布

xfxf996

最新推荐文章于 2024-02-04 11:11:05 发布

阅读量421

点赞数

文章标签： git

原文链接：https://oldbug.net/q/Izjt/How-to-find-identify-large-commits-in-git-history

版权

本文介绍了如何在大型git仓库的历史记录中查找意外提交的重型文件，如视频或大图片。通过脚本和工具，如查找大对象的脚本、结合文本文件获取文件信息，以及使用BFG Repo-Cleaner，可以有效地定位并可能删除这些文件，从而减小git存储库的大小。

摘要由CSDN通过智能技术生成

本文翻译自：How to find/identify large commits in git history?

I've got a git repo of 300 MB. 我有一个300 MB的git repo。 My currently checked-out files weigh 2 MB, and the git repo weighs 298 MB. 我目前检出的文件重2 MB，git repo重298 MB。 This is basically a code-only repo that should not weigh more than a few MB. 这基本上是一个仅限代码的回购，重量不应超过几MB。

Most likely, somebody at some point committed some heavy files by accident (video, huge images, etc), and then removed them... but not from git, so we have a history with useless large files. 最有可能的是，有人在某个时候偶然犯了一些重文件（视频，巨大的图像等），然后将它们删除......但不是从git中删除，所以我们有无用大文件的历史。 How can I track down the large files in the git history? 如何在git历史记录中追踪大文件？ There are 400+ commits, so going one by will be time-consuming. 有400多个提交，所以一个接一个将是耗时的。

NOTE : my question is not about how to remove the file , but how to find it in the first place. 注意：我的问题不是关于如何删除文件 ，而是如何在第一时间找到它。

#1楼

参考：https://stackoom.com/question/Izjt/如何在git历史中查找-识别大型提交

#2楼

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository: 我发现这个脚本在过去非常有用，可以在git存储库中查找大型（和非显而易见的）对象：

http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/ http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

#!/bin/bash
#set -x 

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one: 这将为您提供blob的对象名称（SHA1sum），然后您可以使用如下脚本：

Which commit has this blob? 哪个提交有这个blob？

... to find the commit that points to each of those blobs. ...找到指向每个blob的提交。

#3楼

Step 1 Write all file SHA1s to a text file: 步骤1将所有文件SHA1写入文本文件：

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Step 2 Sort the blobs from biggest to smallest and write results to text file: 步骤2将blob从最大到最小排序，并将结果写入文本文件：

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Step 3a Combine both text files to get file name/sha1/size information: 步骤3a合并两个文本文件以获取文件名/ sha1 / size信息：

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

Step 3b If you have file names or path names containing spaces try this variation of Step 3a. 步骤3b如果您有包含空格的文件名或路径名，请尝试步骤3a的此变体。 It uses cut instead of awk to get the desired columns incl. 它使用cut而不是awk来获得所需的列。 spaces from column 7 to end of line: 从第7列到第1行的空格：

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | cut -d ' ' -f'1,3,7-' >> bigtosmall.txt
done;

Now you can look at the file bigtosmall.txt in order to decide which files you want to remove from your Git history. 现在，您可以查看文件bigtosmall.txt，以确定要从Git历史记录中删除哪些文件。

Step 4 To perform the removal (note this part is slow since it's going to examine every commit in your history for data about the file you identified): 步骤4执行删除（请注意，此部分很慢，因为它将检查历史记录中有关您标识的文件的数据的每个提交）：

git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD

Source 资源

Steps 1-3a were copied from Finding and Purging Big Files From Git History 步骤1-3a是从Git History中查找和清除大文件中复制的

EDIT 编辑

The article was deleted sometime in the second half of 2017, but an archived copy of it can still be accessed using the Wayback Machine . 该文章于2017年下半年被删除，但仍可使用Wayback Machine访问其存档副本。

#4楼

I've found a one-liner solution on ETH Zurich Department of Physics wiki page (close to the end of that page). 我在苏黎世联邦理工学院维基页面（靠近该页末尾）找到了一个单线解决方案。 Just do a git gc to remove stale junk, and then 只需做一个git gc来删除过时的垃圾，然后

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

will give you the 10 largest files in the repository. 将为您提供存储库中的10个最大的文件。

There's also a lazier solution now available, GitExtensions now has a plugin that does this in UI (and handles history rewrites as well). 现在还有一个更加懒惰的解决方案， GitExtensions现在有一个插件可以在UI中执行此操作（并处理历史记录重写）。

GitExtensions'查找大文件'对话框

#5楼

You should use BFG Repo-Cleaner . 你应该使用BFG Repo-Cleaner 。

According to the website: 根据网站：

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history: BFG是git-filter-branch的一种更简单，更快速的替代方法，用于清除Git存储库历史记录中的错误数据：

Removing Crazy Big Files 删除疯狂的大文件
Removing Passwords, Credentials & other Private data 删除密码，凭据和其他私人数据

The classic procedure for reducing the size of a repository would be: 减小存储库大小的经典过程是：

git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

#6楼

If you only want to have a list of large files, then I'd like to provide you with the following one-liner ( source at renuo ): 如果你只想要一个大文件列表，那么我想为你提供以下单行（来源于renuo ）：

join -o "1.1 1.2 2.3" <(git rev-list --objects --all | sort) <(git verify-pack -v objects/pack/*.idx | sort -k3 -n | tail -5 | sort) | sort -k3 -n

Whose output will be: 谁的输出将是：

commit       file name                                  size in bytes

72e1e6d20... db/players.sql 818314
ea20b964a... app/assets/images/background_final2.png 6739212
f8344b9b5... data_test/pg_xlog/000000010000000000000001 1625545
1ecc2395c... data_development/pg_xlog/000000010000000000000001 16777216
bc83d216d... app/assets/images/background_1forfinal.psd 95533848

The last entry in the list points to the largest file in your git history. 列表中的最后一个条目指向git历史记录中的最大文件。

You can use this output to assure that you're not deleting stuff with BFG you would have needed in your history. 您可以使用此输出来确保您不会删除历史记录中您需要的BFG内容。