What is the mechanism for storing the storage of large Git files?

Github recently introduced an extension to git to store large files in a different way. What exactly do they mean by extension replacing large files with text pointers inside Git ?

+3
git github
Apr 09 '15 at 5:07
source share
1 answer

You can see in git-lfs sources as a "text pointer" :

type Pointer struct { Version string Oid string Size int64 OidType string } 

smudge and clean git-lfs sources can use a content filter filter to:

  • upload actual files at checkout
  • keep them in your external source when committing.

See pointer specifications :

The main idea of ​​Git LFS is that instead of writing big drops to the Git repository, only the pointer file is written .

 version https://git-lfs.github.com/spec/v1 oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393 size 12345 (ending \n) 

Git LFS requires an endpoint URL to communicate with a remote server.
A Git repository can have different LFS Git endpoints for different remotes.

The actual file is downloaded or downloaded from a server that respects the git-LFS API .

This is confirmed by the git-lfs man page , which says:

The actual file gets into the Git LFS API

You need a Git server that implements this API in order to support loading and downloading binary content.




As for the content filter driver (which has existed in Git for a long time, long before lfs and is used here with lfs to add this “big file management” feature), the main part of the work

Grease filter runs because files are extracted from the Git repository to the working directory
Git sends the contents of the Git blob as STDIN and expects the content to be written to the working directory as STDOUT.

Read 100 bytes.

  • If the ASCII content matches the pointer format file:
    Locate the file in .git / lfs / objects / {OID}.

  • If not, download it from the server.
    Read its contents in STDOUT

  • Otherwise, just pass STDIN through STDOUT.

A clean filter works as files are added to the repository.
Git sends the contents of the file being added as STDIN and expects the contents to be written to Git as STDOUT.

  • Stream binary content from STDIN to a temporary file when calculating its SHA-256 signature.
  • Check the file for .git/lfs/objects/{OID} .
  • If it does not exist:
    • The OID queue to load.
    • Move the temp file to .git/lfs/objects/{OID} .
  • Delete the temporary file.
  • Enter the pointer file in STDOUT.



Git 2.11 (November 2016) has a commit detailing how it works: commit edcc858 , assisted by Martin-Louis Bright and signed: Lars Schneider.

convert : add filter.<driver>.process option

The clean / smudge git mechanism invokes an external filtering process for each individual block that the filter affects. If Git filters many blobs, then the start-up time of external filtering processes can be a significant part of the overall Git runtime.

In a preliminary performance test, this developer used a clean / smudge filter written in golang to filter 12,000 files. This process took 364s with an existing filter mechanism and 5 with a new mechanism. See here: git-lfs / git-lfs # 1382

This patch adds the filter.<driver>.process string filter.<driver>.process , which, if used, supports the external filter process and processes all drops with the packet format protocol ( pkt-line ) based on standard input and standard output .
The full protocol is explained in detail in Documentation/gitattributes.txt .

A few key decisions:

  • The long-term filtering process is called version 2 filtering protocol, because the existing single filter call of the considered version 1 is considered.
  • Git sends a welcome message and waits for a response immediately after the external filter process has begun. This ensures that Git will not freeze if the version 1 filter is not used correctly with filter.<driver>.process for version 2 filters. In addition, Git can detect this error and warn the user.
  • The state of the filter operation (for example, “success” or “error”) before the actual response and (if necessary!) Re-set after the response. The advantage of this two-step answer is that if the filter detects an error earlier, then the filter can communicate with this and Git does not even need to create structures to read the response.
  • All status answers are pkt-line lists completed by a flash package. This allows us to send other status fields with the same protocol in the future.

This means the warning is set in Git 2.12 (Q1 2017)

See commit 7eeda8b (December 18, 2016) and commit c6b0831 (December 3, 2016) Lars Schneider ( larsxschneider ) .
(Combined Junio ​​C Hamano - gitster - on commit 08721a0 , December 27, 2016)

docs : warn about possible " = " in the values ​​of the cleaning / lubrication filter process

The path name value in the key=value filter / grease filter process may contain the character ' = ' (entered in edcc858 ).
Let the user know about this problem in the docs, add the appropriate test case and fix the problem in the parser of the example implementation filter process in contrib .

+9
Apr 09 '15 at 6:52
source share



All Articles