Git gc -aggressive vs git repack

I am looking for ways to reduce the size of the git repository. The search leads me to git gc --aggressive most of the time. I also read that this is not the preferred approach.

Why? what should i know if i run gc --aggressive ?

git repack -a -d --depth=250 --window=250 recommended by gc --aggressive . What for? How to repack reduce repository size? Also, I don't quite understand the --depth and --window .

What should I choose between gc and repack ? When should gc and repack be used?

+70
git version-control github
Feb 25 '15 at 13:23
source share
5 answers

Currently there is no difference: git gc --aggressive works according to the Linus proposal made in 2007; see below. Starting with version 2.11 (Q4 2016), git by default has a depth of 50. A window of size 250 is good because it scans most of every object, but depth 250 is bad because each chain refers to a very deep old object, which slows down all future git operations for marginally less disk usage.




Historical past

Linus suggested (see the full mailing list below) to use git gc --aggressive only when you said it had a “really bad package” or “really terribly bad deltas,” but “almost always,” in other cases, it is actually very bad. "The result may even leave your storage in worse condition than when you started!

The team that he proposes to do right after importing a “long and complicated story”,

 git repack -a -d -f --depth=250 --window=250 

But this assumes that you have already removed the unwanted gank from your repository history and that you followed the checklist for shortening the repository found in the git filter-branch documentation.

git -f ilter-branch can be used to get rid of a subset of files, usually with some combination of --index-filter --subdirectory-filter and --subdirectory-filter . People expect the resulting repository to be smaller than the original, but you need to take a few more steps to make it smaller, because Git tries not to lose your objects until you say that. First make sure that:

  • You really deleted all variants of the file name if the blob was moved during its lifetime. git log --name-only --follow --all -- filename can help you find renames.

  • You really filtered out all the links: use --tag-name-filter cat -- --all when calling git filter-branch .

Then there are two ways to get a smaller repository. A safer way is cloning, keeping your original intact.

  • Clone it with git clone file:///path/to/repo . The clone will have no deleted objects. Watch the bastard. (Note that cloning with a simple path just tightly binds everything!)

If you really do not want to clone it for any reason, check the following items (in that order). This is a very disruptive approach, so back up or go back to cloning it. You have been warned.

  • Remove source links copied with git -f ilter-branch: say

     git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d 
  • git reflog expire --expire=now --all all reflogs with git reflog expire --expire=now --all .

  • git gc --prune=now garbage collects all objects without git gc --prune=now using git gc --prune=now (or if your git gc not new enough to support the arguments for --prune , use git repack -ad; git prune use git repack -ad; git prune git repack -ad; git prune instead).




 Date: Wed, 5 Dec 2007 22:09:12 -0800 (PST) From: Linus Torvalds <torvalds at linux-foundation dot org> To: Daniel Berlin <dberlin at dberlin dot org> cc: David Miller <davem at davemloft dot net>, ismail at pardus dot org dot tr, gcc at gcc dot gnu dot org, git at vger dot kernel dot org Subject: Re: Git and GCC In-Reply-To: <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com> Message-ID: <alpine.LFD.0.9999.0712052132450.13796@woody.linux-foundation.org> References: <4aca3dc20712051947t5fbbb383ua1727c652eb25d7e@mail.gmail.com> <20071205.202047.58135920.davem@davemloft.net> <4aca3dc20712052032n521c344cla07a5df1f2c26cb8@mail.gmail.com> <20071205.204848.227521641.davem@davemloft.net> <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com> 

On Thursday, December 6, 2007, Daniel Berlin wrote:

In fact, it turns out that git-gc --aggressive does this git-gc --aggressive thing to pack files sometimes regardless of whether you converted from the SVN repository or not.

Absolutely. git --aggressive is mostly dumb. This is really only useful for the case "I know that I have really bad packaging and I want to discard all the bad decisions that I made."

To explain this, it's worth explaining (you probably know about it, but let me go through the basics anyway) how git delta chains work, and how they are so different from most other systems.

In other SCMs, the delta chain is usually fixed. It can be “forward” or “backward”, and it can evolve a little when you work with the repository, but, as a rule, it is a chain of changes in one file, presented as a single entity SCM. In CVS, obviously *,v file *,v and many other systems do pretty similar things.

Git also makes delta chains, but makes them much more "free." There is no fixed entity. Deltas are generated against any other random version that git considers to be a good candidate for deltas (with various fairly successful heuristics), and there are absolutely no strict grouping rules.

This is generally a very good thing. This is good for various conceptual reasons (that is, internal git never even needs to take care of the entire revision chain - it does not think in terms of deltas at all), but it is also great because getting rid of inflexible delta rules means that git for example does not there are problems with combining the two files - there simply are no arbitrary *,v "revision files" that have any hidden meaning.

It also means that the choice of deltas is a much more open issue. If you limit the chain of deltas to just one file, you really won't have much choice on what to do with deltas, but in git it really could be a completely different problem.

And so --aggressive appears really poorly named --aggressive . Although git usually tries to reuse delta information (because it is a good idea, and it does not waste CPU time reusing -f all the good deltas that we found earlier) sometimes you want to say: "let's start all over again from scratch, ignore all previous delta information and try to create a new set of deltas. "

Thus, --aggressive is not really about being aggressive, but about wasting CPU time on the decision we made earlier!

This is sometimes good. Some import tools, in particular, can generate terribly bad deltas. For example, everything that git fast-import uses most likely does not have much delta markup, so it’s worth saying, “I want to start from scratch.”

But almost always, in other cases, this is really a very bad thing. This will result in a loss of CPU time, and especially if you really did a good job of deltaing earlier, the end result will not reuse all those good deltas that you have already found, so in fact you will also get a much worse end result. !

I will send the patch to Junio ​​to simply remove the git gc --aggressive . This may be useful, but usually only useful when you really understand very deeply what he is doing, and this documentation does not help you.

Generally, running incremental git gc correctly is better than git gc --aggressive . He is going to reuse old deltas, and when these old deltas cannot be found (primarily to create incremental GC!), He is going to create new ones.

On the other hand, it is definitely true that the “initial import of a long and complex history” is the point at which it is worth spending a lot of time looking for really good deltas. Then, each user (unless he uses git gc --aggressive to cancel it!) git gc --aggressive advantage of this one-time event. So, especially for large projects with a long history, it's probably worth the extra work by telling the delta search code.

So the equivalent of git gc --aggressive - but done right - to do (overnight) something like

 git repack -a -d --depth=250 --window=250 

where this depth is only about how deep the delta chains can be (making them longer for the old story - it costs space), and the essence of the window is how big the window of the object we want every candidate delta to scan is.

And here, you might want to add the -f flag (that is, “discard all old deltas”), as you are really trying to make sure that this one really finds good candidates.

And then it will take forever and a day (that is, do it in one night). But the end result is that everyone who is downstream of this repository will receive much better packages without spending any effort on it.

  Linus 
+63
Feb 25 '15 at 2:05
source share

When should I use gc & repack?

As I mentioned in “ Git Garbage Collection doesn't seem to work completely ”, git gc --aggressive is neither sufficient nor even sufficient.
And, as I explain below , they are often not needed.

The most effective combination would be to add git repack , but also git prune :

 git gc git repack -Ad # kills in-pack garbage git prune # kills loose garbage 



Note: Git 2.11 (Q4 2016) will set gc aggressive depths to the default value of 50

See the 07e7dbf commit (August 11, 2016) by Jeff King ( peff ) .
(Combined by Junio ​​C Hamano - gitster - in commit 0952ca8 , September 21, 2016)

gc : default aggressive depth to 50

" git gc --aggressive " is used to limit the length of the delta chain to 250, which is too deep for additional space savings and adversely affects runtime performance.
The limit has been reduced to 50.

Bottom line: the current default value of 250 does not save much space and costs the CPU. This is not a good compromise.

The --aggressive flag for git-gc does three things:

  1. use " -f " to discard existing deltas and recount
  2. use "--window = 250" to search harder for deltas
  3. use "--depth = 250" to create longer delta chains

Items (1) and (2) are well suited for “aggressive” repackaging.
They ask the repack to do more computational work in the hope of getting a better package. You pay the costs during the repacking, while other operations see only benefits.

Point (3) is not so clear.
Resolution of longer chains means less restriction on deltas, which means a potential search for the best and saving some space.
But it also means that operations that access deltas must follow longer chains, which affects their performance.
So this is a compromise, and it is not clear that the compromise is even good.

(See commit for study )

You can see that CPU savings for regular operations improve as the depth decreases.
But we also see that space savings are not so great as depth increases. Saving 5-10% between 10 and 50 is probably worth the trade off with the processor. Saving 1% for switching from 50 to 100 or another 0.5% for switching from 100 to 250 is most likely not possible.




Speaking of processor preservation, git repack learned to accept --threads=<n> and pass it to packaging objects.

See Commit 40bcf31 (April 26, 2017) by Junio ​​S. Hamano ( gitster ) .
(Merged Junio ​​C Hamano - gitster - at commit 31fb6f4 , May 29, 2017)

repack: accept --threads=<n> and pass it pack-objects

We already do this for --window=<n> and --depth=<n> ; this will help when the user wants to force --threads=1 for reproducible testing without being affected by multithreading.

+46
Feb 25 '15 at 13:36
source share

The problem with git gc --aggressive is that the parameter name and documentation are misleading.

As Linus himself explains in this letter that git gc --aggressive basically does this:

While git usually tries to reuse delta information (because it’s a good idea, and it doesn’t waste CPU time re-discovering all the good deltas that we discovered earlier), sometimes you want to say “let's get started with clean slide, and ignore all previous triangles information and try to create a new set of deltas. "

Usually there is no need to recalculate the delta in git, since git defines these deltas very flexible. It makes sense if you know that you have really, really bad deltas. As Linus explains, this category primarily includes tools that use git fast-import .

In most cases, git does a good job of defining useful deltas, and using git gc --aggressive will leave you with deltas, which could potentially be even worse, spending a lot of time on the processor.




Linus ends his post with the conclusion that git repack with a large --depth and --window is the best choice in most cases; especially after you have imported a large project and want to make sure git finds good deltas.

So, the equivalent of git gc --aggressive - but done correctly - should do (in one night) something like

git repack -a -d --depth=250 --window=250

where this depth is how deep the delta chains can be (to make them longer for the old story - it’s worth the space above your head), and the window shows how big the object’s window is, we want each delta candidate to scan.

And here you may want to add the -f flag (which is "discard all old deltas", as you are now trying to make sure that this one actually finds good candidates.

+13
Feb 25 '15 at 13:41
source share

Attention. Do not run git gc --agressive with a repository that is not in sync with the remote unless you have backups.

This operation recreates deltas from scratch and can lead to data loss in the event of a graceful interruption.

For my 8GB computer, aggressive gc ran out of memory in a 1Gb repository with 10k small commits. When the OOM killer stopped the git process - it left me an almost empty repository, only the working tree and a few deltas were preserved.

Of course, this was not the only copy of the repository, so I just recreated it and extracted it from the remote repository (the selection did not work on the broken repo and was blocked in the “resolution deltas” step several times when I tried to do this), but if your repository is Local repo one developer without remotes at all - first backup.

+6
Jun 03
source share

Note: beware of using git gc --aggressive , like git gc --aggressive Git 2.22 (Q2 2019).

See commit 0044f77 , commits daecbf2 , commits 7384504 , commits 22d4e3b , commits 080a448 , commits 54d56f5 , commits d257e0f , commits b6a8d09 (07 Apr 2019), and commits fc559fb , commits cf9cd77 , commits March 2211 (bard21) avar )
(Combined Junio ​​C Hamano - gitster - in commit ac70c53 , April 25, 2019)

gc docs: downplay utility --aggressive

Existing gc --aggressive documents do not recommend users to run it regularly.
I personally spoke with many users who took these documents as advice to use this option, and, as a rule, this is a (mostly) waste of time .

So, let me clarify what he is actually doing, and let the user draw his own conclusions.

Let me also clarify "The effects are [...] permanent" to paraphrase a brief version of Jeff King 's explanation .

This means that the git-gc documentation now includes :

Aggressive

When the --aggressive option is --aggressive , git-repack will be invoked with the -f flag, which in turn will pass --no-reuse-delta to git-pack-pack objects .
This will discard any existing deltas and recount them, due to the fact that you spend much more time on repackaging.

The effects of this are mostly permanent, for example, when packages and loose objects are combined into one package, existing deltas in this package can be reused, but there are also various cases where we can choose a suboptimal delta from a newer package instead.

In addition, when --aggressive will be --window options --depth and --window passed to git-repack .
See gc.aggressiveDepth and gc.aggressiveWindow below.
Using a larger window size, we are more likely to find more optimal deltas.

You probably should not use this option in this repository without running specialized performance tests on it .
This takes a lot longer, and the resulting space / delta optimization may or may not be worth it. Not using this at all is the right compromise for most users and their repositories.

And ( commit 080a448 ):

gc docs: note how --aggressive affects --window and --depth

Starting from 07e7dbf ( gc : default aggressive depth to 50, 2016-08-11, Git v2.10.1) we are somewhat embarrassingly using the same depth in --aggressive as the default.

As noted in this commit, which makes sense, it was wrong to set the default depth for “aggressive” and thus save disk space due to run-time performance, which is usually the opposite of who the “aggressive gc” wants.

0
Apr 27 '19 at 22:12
source share



All Articles