Github fork explanation and file storage method

I'm just wondering what happens when the fork runs on github.

For example, when I create a project, does it make a copy on github server of all this code or simply creates a link to it?

So, another question: In git, since it hashes all files, if you add the same file to it, it does not need to store the contents of the file again, because the hash will already be on the system, right?

Is github like that? So if I manage to download the same code fragment as another user, when github gits it just creates a link to this file, since it will have the same hash, or will it save all its contents separately again?

Any enlightenment would be great, thanks!

+7
source share
3 answers

github.com is exactly the same semantics as git, but with a web-based interface wrapped around it.

Storage : "Git stores each revision of a file as a unique blob"
Therefore, each file is stored unambiguously, but it uses the SHA-1 hash to determine changes from file to file.

As for github, the fork is essentially a clone. This means that the new fork is the new storage area on its servers with a link to its ORIGIN. He in no way established a connection between the two, because git is inherently capable of tracking consoles. Each plug knows upward flow.

When you say, “If I can download the same piece of code as another user,” the term “download” is a little vague in the sense of “git”. If you work in the same repository, and git even allows you to commit the same file, it means that it was different and it was checked in this revision. But if you mean working on the clone / fork of another repo, this will be the same situation, but there will also be no links to the file system to another repo.

I can’t say that you have any deep knowledge about what github optimizations can do under the hood, on their internal system. Perhaps they can perform intermediate user operations to save disk space. But everything that they would do would be transparent to you and would not make much difference, since it should always function effectively in the expected semantics of git.

The github developer wrote a blog post about how they internally execute their git workflow. Although this does not apply to your question about how they manage the actual workflow of the service, I think this quote from the conclusion is quite informative:

Git itself is quite difficult to understand what makes the workflow so you use it more complex than necessary, just add more mental overhead for every day. I will always defend the simplest system that will work for your team and do it until it no longer works, and then adds complexity only as absolutely necessary.

What I take away from this, they recognize how complex git is in and of itself, so most likely they take the lightest touch that you can wrap around it to provide a service, and let git do what it does best from the beginning.

+4
source

I don’t know exactly how GitHub does it, but here is a possible way. This requires some knowledge of how git stores its data.

The short answer is that repositories can share the objects database, but each has its own links.
We can even imitate it locally to prove the concept.

There are three things in the bare repo directory (or in the .git/ subdirectory if it's not bare) that are minimal for the repo to work:

  • the objects/ subdirectory in which all objects are stored (commits, trees, blobs ...). They are saved either as files with names equal to the hash of the object, or in .pack files.
  • The refs/ subdirectory, which stores simple files, such as refs/heads/master , whose contents are the hashes of the object that it refers to.
  • a HEAD file that says what is the current commit. Its value is either a raw hash (which corresponds to a separate head, that is, we are not on any named branch), or a text link to a ref link where the actual hash can be found (for example ref: refs/heads/master ), which means we are on the master branch)

Suppose someone creates their original (unbranched) orig repo in Github.
To simulate, locally we do

 $ git init --bare github_orig 

We assume this is happening on Github servers. Now there is an empty github repository. Then we imagine that from our own computer we are cloning the github repository:

 $ git clone github_orig local_orig 

Of course, in real life, instead of github_orig we will use https://github... Now we have cloned the github repository into local_orig .

 $ cd local_orig/ $ echo zzz > file $ git add file $ git commit -m initial $ git push $ cd .. 

After that, the github_orig object dir will contain our clicked commit object, one blob for file and one tree object. The refs/heads/master file will contain a commit hash.

Now let's get an image of what might happen when someone pressed the Fork button. We will create the git repository, but manually:

 $ mkdir github_fork $ cd github_fork/ $ cp ../github_orig/HEAD . $ cp -r ../github_orig/refs . $ ln -s ../github_orig/objects $ cd .. 

Please note that we copy HEAD and refs , but make a symbolic link for objects . As we can see, making a plug is very cheap. Even if we have dozens of branches, each of them is just a file in the refs/heads directory that contains a simple hexadecimal hash (40 bytes). For objects we refer only to the catalog of source objects - we do not copy anything!

Now we mimic that the user creating the fork locally clones the forked repo:

 $ git clone github_fork local_fork $ cd local_fork $ # ls .git/ file 

We see that we have successfully cloned, although the repo to which we clone does not have its own objects , but a link to the original repo. The fork user can now create branches, github_fork , and then click on github_fork . Objects will be placed in the objects directory, which will be the same for github_orig ! But refs and HEAD will be changed and will no longer match the values ​​in github_orig .

So, the bottom line is that all repositories belonging to the same forking tree have a common pool of objects, while each repo contains its own links. Anyone pushing themselves towards their forked repo modifies their own links, but puts objects in a shared pool.

Of course, in order to be really useful, you need to take care of something even more important - the main thing is that the git garbage collector should not be called if the repo in which it is called has knowledge about all the links, and not just about itself. Otherwise, it can drop objects in the shared pool that are not available from its links, but can be accessed from other refpos links.

+1
source

According to https://enterprise.github.com/releases/2.2.0/notes GitHub Enterprise (and I assume GitHub) somehow divides objects between forks to reduce disk space usage:

This release changes the way GitHub Enterprise repositories are stored, which reduces disk usage by sharing Git objects between forks and improves caching performance when reading repository data.

Here is also more detailed information on how they do it at https://githubengineering.com/counting-objects .

0
source

All Articles