新人刚使用 git 的时候,就像去到一个既不识当地文字也不会说当地语言的陌生的国家。只要你知道你在什么地方、要去哪里,一切都 OK,而一旦你迷路,麻烦就来了。
网上已经有许多关于学习基本的 git 命令的文章,但是本文不属于这一类,而是尝试另辟蹊径。
新手总是被 git 吓到,事实上也很难不被吓到。可以肯定的是 git 是很强大的工具但还不够友好。大量的新概念,有些命令用文件做参数和不用文件做参数各自执行的动作截然不同,还有隐晦的回馈等…
我以为克服第一道难关的方法就是不仅仅是使用 git commit/push 就完了。如果我们花点时间去真正了解到底git是由什么构造的,那将会省去不少麻烦。
初探 .git
那么我们开始吧。当你创建一个仓库的时候,使用 git init 指令, git 将会创建一个神奇的目录:.git。这个目录下包含了所有 git 正常工作所需要的信息。说白一点,如果你想从你的项目中删除 git 但是又要保留项目文件,只需要删除 .git 文件夹就可以了。但是,你确定要辣么做?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
├──
HEAD
├──
branches
├──
config
├──
description
├──
hooks
│
├──
pre
-
commit
.
sample
│
├──
pre
-
push
.
sample
│
└──
.
.
.
├──
info
│
└──
exclude
├──
objects
│
├──
info
│
└──
pack
└──
refs
├──
heads
└──
tags
|
这就是你第一次提交之前 .git 目录的样子:
- 这个我们稍后会讨论
- 这个文件包含你仓库的设置信息。例如这里会放你远程仓库的 URL,你的 email 地址,你的用户名等…。 每次你在控制台使用“git config…”指令时,修改的就是这里。
- gitweb(可以说是 github 的前身)用来显示仓库的描述。
- 这是一个有意思的特性。Git 提供了一系列的脚本,你可以在 git 每一个有实质意义的阶段让它们自动运行。这些脚本就是 hooks,可以在 commit/rebase/pull…. 的前后运行。脚本的名字表示它什么时候被运行。例如一个有用的预推送 hook 可能会测试关于保持远程仓库一致性的式样原则。
- 你可以把你不想让 git 处理的文件放到 .gitignore 文件里。那么,exclude 文件也有同样的作用,不同的地方是它不会被共享,比如当你不想跟踪你的自定义的 IDE 相关的配置文件时,即使通常情况下 .gitignore 就足够了(如果你用到了这个请在评论中告诉我)。
commit 的真相
每一次你创建一个文件并跟踪它会发现,git 会对其进行压缩然后以 git 自己的数据结构形式来存储。这个压缩的对象会有一个唯一的名字,即一个哈希值,这个值存放在 object 目录下。
在探索 object 目录前,我们先要问自己 commit 到底是何方神圣。commit 大致可以视为你工作目录的快照,但是它又不仅仅只是一种快照。
实际上,当你提交的时候,为创建你工作目录的快照 git 只做了两件事:
- 如果这个文件没有改变,git 仅仅只把压缩文件的名字(就是哈希值)放入快照。
- 如果文件发生了变化,git 会压缩它,然后把压缩后的文件存入 object 目录。最后再把压缩文件的名字(哈希值)放入快照。
这里只是简单介绍,整个过程有一点复杂,以后的博客里会作说明的。
一旦快照创建好,其本身也会被压缩并且以一个哈希值命名。那么所有的压缩对象都放在哪里呢?答案是object 目录。
1
2
3
4
5
6
7
8
|
├──
4c
│
└──
f44f1e3fe4fb7f8aa42138c324f63f5ac85828
// hash
├──
86
│
└──
550c31847e518e1927f95991c949fc14efc711
// hash
├──
e6
│
└──
9de29bb2d1d6434b8b29ae775ad8c2e48c5391
// hash
├──
info
// let's ignore that
└──
pack
// let's ignore that too
|
这就是我创建一个空文件 file_1.txt 并提交后 object 目录看起来的样子。请注意如果你的文件的哈希值是“89faaee…”,git 会把这个文件存在 “89” 目录下然后命名这个文件为 “faaee…”。
你会看到3个哈希。一个对应 file_1.txt ,另一个对应在提交时所创建的快照。那么第三个是什么呢?其实是因为 commit 本身也是一个对象并且也被压缩存放在 object 目录下。
现在,你需要记住的是一个 commit 包含四个部分:
- 工作目录快照的哈希
- 提交的说明信息
- 提交者的信息
- 父提交的哈希值
如果我们解压缩一个提交,你自己可以看看到底是什么:
1
2
3
4
|
// by looking at the history you can easily find your commit hash
// you also don't have to paste the whole hash, only enough
// characters to make the hash unique
git
cat
-
file
-
p
4cf44f1e3fe4fb7f8aa42138c324f63f5ac85828
|
这是我看到的
1
2
3
4
|
tree
86550c31847e518e1927f95991c949fc14efc711
author
Pierre
De
Wulf
&
amp
;
amp
;
lt
;
test
@
gmail
.
com
&
amp
;
amp
;
gt
;
1455775173
-
0500
committer
Pierre
De
Wulf
&
amp
;
amp
;
lt
;
test
@
gmail
.
com
&
amp
;
amp
;
gt
;
1455775173
-
0500
commit
A
|
如你所见我们得到了所期望看到的的:快照的哈希,作者,提交信息。这里有两样东西很重要:
- 正如预料的一样,快照的哈希 “86550…” 也是一个对象并且能在object目录下找到。
- 因为这是我的第一个提交,所以没有父提交。
那我的快照里面到底是些什么呢?
1
2
|
git
cat
-
file
-
p
86550c31847e518e1927f95991c949fc14efc711
100644
blob
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
file_1
.
txt
|
到这里我们看到的最后一个对象是我们先前提到的唯一会存在于快照中的对象。它是一个 blob(二进制文件),这里就不作深究了。
分支,标签,HEAD 都是一家人
那么现在你知道 git 的每一个对象都有一个正确的哈希值。现在我们来看看 HEAD 吧!那么,在 HEAD 里又有什么呢?
1
2
|
cat
HEAD
ref
:
refs
/
heads
/
master
|
这看起来 HEAD 不是一个hash,倒是容易理解,因为 HEAD 可以看作一个你目前所在分支的指针。如果我们看看 refs/heads/master,就会发现这些:
1
2
|
cat
refs
/
heads
/
master
4cf44f1e3fe4fb7f8aa42138c324f63f5ac85828
|
是不是很熟悉?是的,这和我们第一个提交的哈希完全一样。由此表明分支和标签就是一个提交的指针。明白这一点你就可以删除所有你想删除的分支和标签,而他们指向的提交依然在那里。只是有点难以被访问到。如果你想对这部分了解更多,请参考git book。
尾声
到目前为止你应该了解到, git 所做的事就是当你提交的时候“压缩”当前的工作目录,同时将其和其他一些信息一并存入 objects 目录。但是如果你足够了解 git 的话,你就能完全控制提交时哪些文件应该放进去而哪些不应该放。
我的意思是,一个提交并非真正意义上是一个你当前工作目录的快照,而是一个你想提交的文件的快照。在提交之前 git 把你想提交的文件放在哪里? git 把他们放在 index 文件里。
关于index文件:
以下摘自git/Documentation/technical/index-format.txt
Git index format | |
================ | |
== The Git index file has the following format | |
All binary numbers are in network byte order. Version 2 is described | |
here unless stated otherwise. | |
- A 12-byte header consisting of | |
4-byte signature: | |
The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache") | |
4-byte version number: | |
The current supported versions are 2, 3 and 4. | |
32-bit number of index entries. | |
- A number of sorted index entries (see below). | |
- Extensions | |
Extensions are identified by signature. Optional extensions can | |
be ignored if Git does not understand them. | |
Git currently supports cached tree and resolve undo extensions. | |
4-byte extension signature. If the first byte is 'A'..'Z' the | |
extension is optional and can be ignored. | |
32-bit size of the extension | |
Extension data | |
- 160-bit SHA-1 over the content of the index file before this | |
checksum. | |
== Index entry | |
Index entries are sorted in ascending order on the name field, | |
interpreted as a string of unsigned bytes (i.e. memcmp() order, no | |
localization, no special casing of directory separator '/'). Entries | |
with the same name are sorted by their stage field. | |
32-bit ctime seconds, the last time a file's metadata changed | |
this is stat(2) data | |
32-bit ctime nanosecond fractions | |
this is stat(2) data | |
32-bit mtime seconds, the last time a file's data changed | |
this is stat(2) data | |
32-bit mtime nanosecond fractions | |
this is stat(2) data | |
32-bit dev | |
this is stat(2) data | |
32-bit ino | |
this is stat(2) data | |
32-bit mode, split into (high to low bits) | |
4-bit object type | |
valid values in binary are 1000 (regular file), 1010 (symbolic link) | |
and 1110 (gitlink) | |
3-bit unused | |
9-bit unix permission. Only 0755 and 0644 are valid for regular files. | |
Symbolic links and gitlinks have value 0 in this field. | |
32-bit uid | |
this is stat(2) data | |
32-bit gid | |
this is stat(2) data | |
32-bit file size | |
This is the on-disk size from stat(2), truncated to 32-bit. | |
160-bit SHA-1 for the represented object | |
A 16-bit 'flags' field split into (high to low bits) | |
1-bit assume-valid flag | |
1-bit extended flag (must be zero in version 2) | |
2-bit stage (during merge) | |
12-bit name length if the length is less than 0xFFF; otherwise 0xFFF | |
is stored in this field. | |
(Version 3 or later) A 16-bit field, only applicable if the | |
"extended flag" above is 1, split into (high to low bits). | |
1-bit reserved for future | |
1-bit skip-worktree flag (used by sparse checkout) | |
1-bit intent-to-add flag (used by "git add -N") | |
13-bit unused, must be zero | |
Entry path name (variable length) relative to top level directory | |
(without leading slash). '/' is used as path separator. The special | |
path components ".", ".." and ".git" (without quotes) are disallowed. | |
Trailing slash is also disallowed. | |
The exact encoding is undefined, but the '.' and '/' characters | |
are encoded in 7-bit ASCII and the encoding cannot contain a NUL | |
byte (iow, this is a UNIX pathname). | |
(Version 4) In version 4, the entry path name is prefix-compressed | |
relative to the path name for the previous entry (the very first | |
entry is encoded as if the path name for the previous entry is an | |
empty string). At the beginning of an entry, an integer N in the | |
variable width encoding (the same encoding as the offset is encoded | |
for OFS_DELTA pack entries; see pack-format.txt) is stored, followed | |
by a NUL-terminated string S. Removing N bytes from the end of the | |
path name for the previous entry, and replacing it with the string S | |
yields the path name for this entry. | |
1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes | |
while keeping the name NUL-terminated. | |
(Version 4) In version 4, the padding after the pathname does not | |
exist. | |
Interpretation of index entries in split index mode is completely | |
different. See below for details. | |
== Extensions | |
=== Cached tree | |
Cached tree extension contains pre-computed hashes for trees that can | |
be derived from the index. It helps speed up tree object generation | |
from index for a new commit. | |
When a path is updated in index, the path must be invalidated and | |
removed from tree cache. | |
The signature for this extension is { 'T', 'R', 'E', 'E' }. | |
A series of entries fill the entire extension; each of which | |
consists of: | |
- NUL-terminated path component (relative to its parent directory); | |
- ASCII decimal number of entries in the index that is covered by the | |
tree this entry represents (entry_count); | |
- A space (ASCII 32); | |
- ASCII decimal number that represents the number of subtrees this | |
tree has; | |
- A newline (ASCII 10); and | |
- 160-bit object name for the object that would result from writing | |
this span of index as a tree. | |
An entry can be in an invalidated state and is represented by having | |
a negative number in the entry_count field. In this case, there is no | |
object name and the next entry starts immediately after the newline. | |
When writing an invalid entry, -1 should always be used as entry_count. | |
The entries are written out in the top-down, depth-first order. The | |
first entry represents the root level of the repository, followed by the | |
first subtree--let's call this A--of the root level (with its name | |
relative to the root level), followed by the first subtree of A (with | |
its name relative to A), ... | |
=== Resolve undo | |
A conflict is represented in the index as a set of higher stage entries. | |
When a conflict is resolved (e.g. with "git add path"), these higher | |
stage entries will be removed and a stage-0 entry with proper resolution | |
is added. | |
When these higher stage entries are removed, they are saved in the | |
resolve undo extension, so that conflicts can be recreated (e.g. with | |
"git checkout -m"), in case users want to redo a conflict resolution | |
from scratch. | |
The signature for this extension is { 'R', 'E', 'U', 'C' }. | |
A series of entries fill the entire extension; each of which | |
consists of: | |
- NUL-terminated pathname the entry describes (relative to the root of | |
the repository, i.e. full pathname); | |
- Three NUL-terminated ASCII octal numbers, entry mode of entries in | |
stage 1 to 3 (a missing stage is represented by "0" in this field); | |
and | |
- At most three 160-bit object names of the entry in stages from 1 to 3 | |
(nothing is written for a missing stage). | |
=== Split index | |
In split index mode, the majority of index entries could be stored | |
in a separate file. This extension records the changes to be made on | |
top of that to produce the final index. | |
The signature for this extension is { 'l', 'i', 'n', 'k' }. | |
The extension consists of: | |
- 160-bit SHA-1 of the shared index file. The shared index file path | |
is $GIT_DIR/sharedindex.<SHA-1>. If all 160 bits are zero, the | |
index does not require a shared index file. | |
- An ewah-encoded delete bitmap, each bit represents an entry in the | |
shared index. If a bit is set, its corresponding entry in the | |
shared index will be removed from the final index. Note, because | |
a delete operation changes index entry positions, but we do need | |
original positions in replace phase, it's best to just mark | |
entries for removal, then do a mass deletion after replacement. | |
- An ewah-encoded replace bitmap, each bit represents an entry in | |
the shared index. If a bit is set, its corresponding entry in the | |
shared index will be replaced with an entry in this index | |
file. All replaced entries are stored in sorted order in this | |
index. The first "1" bit in the replace bitmap corresponds to the | |
first index entry, the second "1" bit to the second entry and so | |
on. Replaced entries may have empty path names to save space. | |
The remaining index entries after replaced ones will be added to the | |
final index. These added entries are also sorted by entry name then | |
stage. | |
== Untracked cache | |
Untracked cache saves the untracked file list and necessary data to | |
verify the cache. The signature for this extension is { 'U', 'N', | |
'T', 'R' }. | |
The extension starts with | |
- A sequence of NUL-terminated strings, preceded by the size of the | |
sequence in variable width encoding. Each string describes the | |
environment where the cache can be used. | |
- Stat data of $GIT_DIR/info/exclude. See "Index entry" section from | |
ctime field until "file size". | |
- Stat data of core.excludesfile | |
- 32-bit dir_flags (see struct dir_struct) | |
- 160-bit SHA-1 of $GIT_DIR/info/exclude. Null SHA-1 means the file | |
does not exist. | |
- 160-bit SHA-1 of core.excludesfile. Null SHA-1 means the file does | |
not exist. | |
- NUL-terminated string of per-dir exclude file name. This usually | |
is ".gitignore". | |
- The number of following directory blocks, variable width | |
encoding. If this number is zero, the extension ends here with a | |
following NUL. | |
- A number of directory blocks in depth-first-search order, each | |
consists of | |
- The number of untracked entries, variable width encoding. | |
- The number of sub-directory blocks, variable width encoding. | |
- The directory name terminated by NUL. | |
- A number of untracked file/dir names terminated by NUL. | |
The remaining data of each directory block is grouped by type: | |
- An ewah bitmap, the n-th bit marks whether the n-th directory has | |
valid untracked cache entries. | |
- An ewah bitmap, the n-th bit records "check-only" bit of | |
read_directory_recursive() for the n-th directory. | |
- An ewah bitmap, the n-th bit indicates whether SHA-1 and stat data | |
is valid for the n-th directory and exists in the next data. | |
- An array of stat data. The n-th data corresponds with the n-th | |
"one" bit in the previous ewah bitmap. | |
- An array of SHA-1. The n-th SHA-1 corresponds with the n-th "one" bit | |
in the previous ewah bitmap. | |
- One NUL. |