How do different version control systems handle binary files?

Question

How do different version control systems handle binary files?

I have heard some claims that SVN handles binaries better than Git / Mercurial. Is that true, and if so, why? As far as I can imagine, no version control system (VCS) can distinguish and merge changes between two versions of the same binary resources.

So, is not all VCS bad at processing binary files? I don't really understand the technical details behind specific VCS implementations, so maybe they have some pros and cons.

+29

git version-control svn mercurial

Tower Jul 06 '11 at 15:09

source share

5 answers

One clarification about git and binaries.

Git compresses binary files as well as text files. So git is not crap when handling binary files, as someone suggested.

Any file added by git will be compressed into free objects. It doesn't matter if they are binary or text. If you have a binary or text file, and you commit it, the repository will grow. If you make a small change to the file and commit again, your repository will grow again at about the same level, depending on the compression ratio.

Then you create git gc . git will find similarities in binary or text files and compress them together. You will have good compression if the similarities are great. If, on the other hand, there is no similarity between the files, you won’t have a big gain compressing them together compared to compressing them separately.

Here is a bitmap (binary) test that I changed a bit:

 martin@martin-laptop:~/testing123$ git init Initialized empty Git repository in /home/martin/testing123/.git/ martin@martin-laptop:~/testing123$ ls -l total 1252 -rw------- 1 martin martin 1279322 Jan 8 22:42 pic.bmp martin@martin-laptop:~/testing123$ git add . martin@martin-laptop:~/testing123$ git commit -a -m first [master (root-commit) 53886cf] first 1 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 pic.bmp // here is the size: martin@martin-laptop:~/testing123$ du -s .git 1244 .git // Changed a few pixels in the picture martin@martin-laptop:~/testing123$ git add . martin@martin-laptop:~/testing123$ git commit -a -m second [master da025e1] second 1 files changed, 0 insertions(+), 0 deletions(-) // here is the size: martin@martin-laptop:~/testing123$ du -s .git 2364 .git // As you can see the repo is twice as large // Now we run git gc to compress martin@martin-laptop:~/testing123$ git gc Counting objects: 6, done. Delta compression using up to 2 threads. Compressing objects: 100% (4/4), done. Writing objects: 100% (6/6), done. Total 6 (delta 1), reused 0 (delta 0) // here is the size after compression: martin@martin-laptop:~/testing123$ du -s .git 1236 .git // we are back to a smaller size than ever...

+10

martin Jan 08 '12 at 16:23

source share

Git and Mercurial process aplomb binaries. They do not damage them, and you can check them. The problem is size.

The source usually takes up less space than binary files. You can have 100K source files that build a 100Mb binary. Thus, saving one assembly in my repository can lead to its growth 30 times more.

And this is even worse:

Version control systems usually store files through a form format. Let's say I have a file of 100 lines and each line contains about 40 characters. Entire 4K file. If I change the line in this file and save this change, I add only about 60 bytes to the size of my repository.

Now, let's say I compiled and added that a 100Mb file. I make changes to my source code (maybe 10K or so), recompile and save the new binary assembly. Well, binaries usually don't differ very well, so most likely I will add another 100 MB of size to my repository. Make several assemblies, and the size of my repository increases to a few gigabytes, but the original part of my repository is only a few tens of kilobytes.

The problem with Git and Mercurial is that you usually check the entire repository on your system. Instead of just downloading a few tens of kilobytes that can be transferred in a few seconds, now I upload several gigabytes of assemblies along with several tens of kilobytes of data.

Perhaps people are saying that Subversion is better, since I can just check the version I want in Subversion and not download the entire repository. However, Subversion does not give you an easy way to remove obsolete binaries from your repository, so your repository will grow and grow anyway. I still do not recommend it. Hell, I don’t even recommend it, even if the version control system allows you to delete old versions of obsolete binary files. (Perforce, ClearCase, and CVS all do). It just ends with a big maintenance headache.

Now this does not mean that you should not store binary files. For example, if I create a webpage, I probably have some gif and jpeg that I need. There is no problem storing data in Subversion or Git / Mercurial. They are relatively small and probably change much less than my code itself.

What you should not store are constructed objects. They should be stored in the release repository and retrieved as needed. Maven and Ant w / Ivy do a great job of this. Alternatively, you can use the Maven repository structure in C, C ++, and C # projects.

+9

David W. Jul 06 2018-11-17T00:

source share

In Subversion, you can lock binary files so that no one can edit them. This basically ensures that no one else will modify this binary until you lock it. Distributed VCSs do not have (and cannot) locks - there is no central repository for their registration.

+2

robert Jul 06 '11 at 18:13

source share

Text files have a natural linear binding that lacks binary files. That is why they are more difficult to compare using ordinary text tools (diff). While this should be possible, the benefit of reading (the reason we use text as our preferred format in the first place) will be lost when diff is applied to binary files.

As for your assumption that all version control systems are “crap while processing binary files,” I don't know. Basically, there is no reason why a binary should be processed more slowly. I would say that the benefits of using VCS (tracking, distinction, review) are more obvious when processing text files.

0

jforberg Jul 06 2018-11-11T00:

source share

VonC · Accepted Answer · 2011-07-06 15:48

The main point of pain in the "distributed" aspect of any DVCS: you clone everything (the entire history of all files)

Since binaries are not stored in delta for most of them and are not compressed, as well as a text file, if you store fast-growing binaries, you quickly get a large repository that becomes much cumbersome to move (drag-and-drop).

For example, for Git, see What are the limitations of Git? .

Binaries are not suitable for the function that VCS can bring (diff, branch, merge) and is better managed in the artifact repository (for example, Nexus , for example).
This is not necessary for CVCS (centralized VCS), where the repository can play this role and be a repository for binary files (even if this is not its main role)

How do different version control systems handle binary files?

More articles: