The team that he proposes to do right after importing a “long and complicated story”,
Date: Wed, 5 Dec 2007 22:09:12 -0800 (PST) From: Linus Torvalds <torvalds at linux-foundation dot org> To: Daniel Berlin <dberlin at dberlin dot org> cc: David Miller <davem at davemloft dot net>, ismail at pardus dot org dot tr, gcc at gcc dot gnu dot org, git at vger dot kernel dot org Subject: Re: Git and GCC In-Reply-To: <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com> Message-ID: <alpine.LFD.0.9999.0712052132450.13796@woody.linux-foundation.org> References: <4aca3dc20712051947t5fbbb383ua1727c652eb25d7e@mail.gmail.com> <20071205.202047.58135920.davem@davemloft.net> <4aca3dc20712052032n521c344cla07a5df1f2c26cb8@mail.gmail.com> <20071205.204848.227521641.davem@davemloft.net> <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com>
On Thursday, December 6, 2007, Daniel Berlin wrote:
In fact, it turns out that git-gc --aggressive does this git-gc --aggressive thing to pack files sometimes regardless of whether you converted from the SVN repository or not.
Absolutely. git --aggressive is mostly dumb. This is really only useful for the case "I know that I have really bad packaging and I want to discard all the bad decisions that I made."
To explain this, it's worth explaining (you probably know about it, but let me go through the basics anyway) how git delta chains work, and how they are so different from most other systems.
In other SCMs, the delta chain is usually fixed. It can be “forward” or “backward”, and it can evolve a little when you work with the repository, but, as a rule, it is a chain of changes in one file, presented as a single entity SCM. In CVS, obviously *,v file *,v and many other systems do pretty similar things.
Git also makes delta chains, but makes them much more "free." There is no fixed entity. Deltas are generated against any other random version that git considers to be a good candidate for deltas (with various fairly successful heuristics), and there are absolutely no strict grouping rules.
This is generally a very good thing. This is good for various conceptual reasons (that is, internal git never even needs to take care of the entire revision chain - it does not think in terms of deltas at all), but it is also great because getting rid of inflexible delta rules means that git for example does not there are problems with combining the two files - there simply are no arbitrary *,v "revision files" that have any hidden meaning.
It also means that the choice of deltas is a much more open issue. If you limit the chain of deltas to just one file, you really won't have much choice on what to do with deltas, but in git it really could be a completely different problem.
And so --aggressive appears really poorly named --aggressive . Although git usually tries to reuse delta information (because it is a good idea, and it does not waste CPU time reusing -f all the good deltas that we found earlier) sometimes you want to say: "let's start all over again from scratch, ignore all previous delta information and try to create a new set of deltas. "
Thus, --aggressive is not really about being aggressive, but about wasting CPU time on the decision we made earlier!
This is sometimes good. Some import tools, in particular, can generate terribly bad deltas. For example, everything that git fast-import uses most likely does not have much delta markup, so it’s worth saying, “I want to start from scratch.”
But almost always, in other cases, this is really a very bad thing. This will result in a loss of CPU time, and especially if you really did a good job of deltaing earlier, the end result will not reuse all those good deltas that you have already found, so in fact you will also get a much worse end result. !
I will send the patch to Junio to simply remove the git gc --aggressive . This may be useful, but usually only useful when you really understand very deeply what he is doing, and this documentation does not help you.
Generally, running incremental git gc correctly is better than git gc --aggressive . He is going to reuse old deltas, and when these old deltas cannot be found (primarily to create incremental GC!), He is going to create new ones.
On the other hand, it is definitely true that the “initial import of a long and complex history” is the point at which it is worth spending a lot of time looking for really good deltas. Then, each user (unless he uses git gc --aggressive to cancel it!) git gc --aggressive advantage of this one-time event. So, especially for large projects with a long history, it's probably worth the extra work by telling the delta search code.
So the equivalent of git gc --aggressive - but done right - to do (overnight) something like
git repack -a -d --depth=250 --window=250
where this depth is only about how deep the delta chains can be (making them longer for the old story - it costs space), and the essence of the window is how big the window of the object we want every candidate delta to scan is.
And here, you might want to add the -f flag (that is, “discard all old deltas”), as you are really trying to make sure that this one really finds good candidates.
And then it will take forever and a day (that is, do it in one night). But the end result is that everyone who is downstream of this repository will receive much better packages without spending any effort on it.
Linus