How are git branches and tags stored on disks?

I recently checked out one of my git repositories, which had over 10,000 branches and over 30,000 tags. The total repo size following the new 12Gigs clone. I am sure there is no reason to have 10,000 branches. Therefore, I believe that they occupy a significant amount of disk space. So my questions are these:

  • How are branches and tags stored on disks, for example, what data structure is used, what information is stored for each branch?
  • How to get metadata about branches? for example, when this branch was created, what is the size of the branch.
+5
git git-branch branch tags
source share
3 answers

All git links (branches, tags, notes, screensavers, etc.) use the same system. It:

  • the links themselves, and
  • "reflogs"

Reflogs are stored in .git/logs/refs/ based on the reference name, with one exception: reflogs for HEAD are stored in .git/logs/HEAD , not .git/logs/refs/HEAD .

Links come either "free" or "packaged." Packed refs are located in .git/packed-refs , which is a flat pair file (SHA-1, refname) for simple links, as well as additional information for annotated tags. "Loose" refs are located in .git/refs/name . These files contain either raw SHA-1 (perhaps the most common) or the literal string ref: followed by the name of another link for symbolic links (usually only for HEAD , but you can create others). Symbolic reflexes are not packaged (or at least I cannot do this :-)).

Tags packaging and downtime branch branches (those that are not updated actively) saves space and time. You can use git pack-refs for this. However, git gc calls git pack-refs for you, so usually you don’t need to do this yourself.

+4
source share

So, I’ll talk a little about the topic and explain how Git stores what. This will explain what information is stored, and what exactly matters to the size of the repository. As a fair warning: this answer is quite long :)

Git objects

Git is essentially a database of objects. These objects come in four different types, and they are all identified by the SHA1 hash of their contents. Four types are drops, trees, commits, and tags.

Blob

Blub is the simplest type of object. It saves the contents of the file. Therefore, for every file content stored in your Git repository, there is one blob in the object database. Since it only stores the contents of a file, not metadata, such as file names, it is also a mechanism that prevents multiple files with the same contents from being stored multiple times.

Wood

Going one level up, a tree is an object that places drops in a directory structure. A single tree corresponds to one directory. This is essentially a list of files and subdirectories, each entry contains file mode, the name of a file or directory, and a link to the Git object that belongs to this entry. For subdirectories, this link points to a tree object that describes the subdirectory; for files, this link points to a blob that stores the contents of the file.

Commit

Blobs and trees are already enough to represent a complete file system. To add to them more truly, we have commit objects. Commit objects are created whenever you do something in Git. Each commit is a snapshot of the change history.

It contains a link to a tree object that describes the root directory of the repository. It also means that every commit that actually introduces some changes at least requires a new tree object (most likely more).

A commit also contains a link to its parent commit. Although usually only one parent exists (for linear history), a commit can have any number of parents, in which case it is usually called a merge commit. Most workflows only make you merge with two parents, but you can really have any other number.

Finally, the commit also contains the metadata that you expect from the commit: Author and committer (name and time) and, of course, the commit message.

This is all that is needed for a complete version control system; but of course there is another type of object:

Tag

Tag objects are one way to store tags. To be precise, tag objects store annotated tags, which are tags that have, like commits, some meta information. They are created using git tag -a (or when creating a signed tag) and require a tag message. They also contain a reference to the commit object that they point to, and a tagger (name and time).

References

So far, we have a complete version control system with annotated tags, but all of our objects are identified using their SHA1 hashes. This, of course, is a little annoying, so we have something else to make it easier: Links.

Links are provided in different versions, but the most important thing in them: these are simple text files containing 40 characters - the SHA1 hash of the object to which they point. Since they are so simple, they are very cheap, so working with many links is not a problem. This does not create overhead, and there is no reason not to use them.

There are usually three types of links: branches, tags, and remote branches. They really work the same way, and they are all designed to fix objects; with the exception of annotated tags that point to tag objects (regular tags just link to links). The difference between them is how you create them and in which subpath /refs/ they are saved. I will not talk about this, although this is explained in almost every Git tutorial; just remember: links, i.e. branches are extremely cheap, so feel free to create them for almost everyone.

Compression

Now, since torrek mentioned something about Git s compression in his answer, I want to clarify this a bit. Unfortunately, he mixed up a little.

So, usually for new repositories, all Git objects are stored in .git/objects as files identified by their SHA1 hash. The first two characters are removed from the file name and used to split the files into several folders, so it becomes easier for them to move around.

At some point, when the story gets bigger or when it starts up with something else, Git will begin to compress objects. This is done by packing several objects into one package file. How it works exactly is not really that important; this will reduce the number of individual Git objects and effectively store them in single, indexed archives (for now, Git will use the delta compression bit.). Then the pack files are saved in .git/objects/pack and can easily receive several hundred megabytes.

For reference, the situation is somewhat similar, although much simpler. All current links are stored in .git/refs , for example. branches in .git/refs/heads , tags in .git/refs/tags and remote branches in .git/refs/remotes/<remote> . As mentioned above, these are simple text files containing only the 40-character identifier of the object they are pointing to.

At some point, Git will move the old links of any type into a single search file: .git/packed-refs . This file is just a long list of hashes and reference names, one entry per line. Links that are stored there are removed from the refs directory.

Reflogs

Torek also mentioned these issues. They track what happens to the links. If you do anything that affects the link (commit, checkout, reset, etc.), then a new log entry is added to record what happened. It also provides an opportunity to return after you have done something wrong. For example, a common use case is accessing a reflog after accidentally dropping a branch to where it should not go. Then you can use git reflog to view the log and see where the link was pointing earlier. Since free Git objects are not deleted immediately (objects that are part of the story are never deleted), you can usually easily restore the previous situation.

Reflogs, however, are local: they only track what happens to your local repository. They are not transmitted with the remote control and are never transmitted. The newly cloned repository will have a single-entry loglog, which is a cloning action. They are also limited to a specific length, after which older activities are trimmed, so they will not become a storage problem.

Some final words

So, back to your current question. When you clone a repository, Git will usually already receive the repository in packaged format. This has already been done to save transmission time. Links are very cheap, so they are never the cause of large repositories. However, due to the nature of Git, one current commit object has an entire acyclic graph in it that ultimately reaches the very first commit, the very first tree and the very first blob. Thus, the repository will always contain all the information for all versions. This is what makes repositories with a big story big. Unfortunately, there is really nothing you can do about it. Well, in some part, you could cut off the old story, but that will leave you with a broken repository (you do this by cloning with the --depth ).

And as for your second question, as I explained above, branches are just links to commits, and links are just pointers to Git objects. No, there is no metadata about branches that you can get from them. The only thing that can give you an idea is the first commit you made when you fell off in your story. But the presence of branches does not automatically mean that there really is a branch stored in the history (fast merging and reloading works against it), and only because the history has a branch in the history, which does not mean that the branch (link, pointer) .

+14
source share

Note. As for pack-refs, the process of creating them should be much faster with Git 2.2+ (November 2014)

See commit 9540ce5 Jeff King ( peff ) :

refs: write packed_refs file using stdio

We write each line of the new file with packed refs separately, using syscall write() (and sometimes 2 if ref is cleared). Since each line has a length of about 50-100 bytes, this creates a large overhead for a system call.

Instead, we can open the stdio handle around our handle and use fprintf to write it. Additional buffering is not a problem for us, because no one will read our new file with packaged refs until we call commit_lock_file (which we all reddened for).

In a pathological repository with 8.5 million refs, this reduced the launch time of git pack-refs from 20 to 6 s .


September 2016 Update: Git 2.11+ will include inpack-refs anchor tags (" chain tags and git clone --single-branch --branch tag ")

And the same Git 2.11 will now use a fully batch raster file .

See commit 645c432 , commit 702d1b9 (September 10, 2016) Kirill Smelkov ( navytux ) .
Helped: Jeff King ( peff ) .
(merged Junio ​​C Hamano - gitster - on commit 7f109ef , 21 Sep 2016)

pack-objects : use a raster reachability index when creating a non-stdout package

Batch bitmaps were introduced in Git 2.0 ( commit 6b8fda2 , December 2013), from google work for JGit .

We use the bitmap API to execute the Counting Objects phase in batch objects, rather than a traditional walk through a graph object.

Now (2016):

Starting with 6b8fda2 (pack-objects: use bitmap images when packing objects) , if the repository has a bitmap index, package objects can be beautiful accelerating the "Count objects". Schedule bypass schedule.
However, this was done only for the case when the resulting packet is sent to stdout rather than being written to the file .

You might want to create batch files on disk for specialized object transfer.
It would be helpful to have some way to override this heuristic:
in order to tell pack-objects that even though it needs to generate files on disk, you can still use reachable bitmaps to go through the traversal.


Note: Git 2.12 illustrates that using a bitmap has a side effect on git gc --auto

See commit 1c409a7 , commit bdf56de (December 28, 2016) by David Turner ( csusbdt ) .
(merged Junio ​​C Hamano - gitster - in commit cf417e2 , January 18, 2017

The raster image index only works for individual packages, so an incremental repack request with raster indexes does not make sense.

Incremental copies are not compatible with raster image indices


Git 2.14 refines pack-objects

See commit da5a1f8 , commit 9df4a60 (May 09, 2017) by Jeff King ( peff ) .
(merger of Junio ​​C Hamano - gitster - on commit 137a261 , May 29, 2017)

pack-objects : disable package reuse for object selection options

If some parameters like --honor-pack-keep , --local or --incremental are used with package objects, then we need to pass want_object_in_pack() to each potential object to see if it should be filtered.
But when the reuse_packfile bitmap optimization is effective, we don’t call that function at all and actually skips adding objects to the whole to_pack list.

This means that we have an error: for some requests, we silently ignore these parameters and include objects in this package that should not be there.

The problem exists from the moment the package reuse code was created in 6b8fda2 (pack-objects: use bitmaps when packing objects, 2013-12-21), but this is unlikely to be in practice.
These options are typically used for packaging on disk, not for transfer packets (which go to stdout ), but we never allowed reuse of packets for packets other than stdout (before 645c432 , we didn’t even use the bitmaps that optimization relies on reuse; after that we explicitly disabled it without packaging on stdout ).

+1
source share

All Articles