How does github determine the language of the project?

I recently worked on a github project in both JavaScript and C ++, and noticed that github marked the project as C ++. If you need to choose one language, this is probably the correct notation, since C ++ code was compiled as a JavaScript library, but it made me think ... how does github figure out which language to mark each project in?

+72
github github-linguist
Mar 15 2018-11-11T00:
source share
5 answers

Update April 2013, nuclearsandwich (GitHub support team or "supportocat"):

If your desired language does not get syntax highlighting, you can contribute to the Linguist library to add it.




(Original answer, October 2012)

This thread in GitHub support explains this:

It simply sums the file sizes for each extension. The largest "wins."

We would like to avoid opening files and parsing their contents, since both will slow down the process ... but this may be the only way to resolve conflicts like this.

Since this is not 100% accurate, this led to the addition of:

I will also vote for a simple manual switch for cases where the assumption is wrong.




Note: as Mark Rushakov mentions in his answer (upvoted), guesses have improved since then with the linguistic project (opened since June 2011).
You can see that there are still problems: GitHub Linguist Issues .
For more details see:

Once a language has been discovered, it is passed to Albino , Pygments wrapper, which performs the actual syntax highlighting.

And you can add linguistic directives to the .gitattributes file .

+75
Mar 15 2018-11-11T00:
source share

Currently, Gigub's linguistic project is what is used to determine language statistics, as described in this Github blog post (which came out a few months after this question was originally asked).

+14
Apr 6 '12 at 18:23
source share

First, be aware that you can override the detected language for files in your repository using Linguist overrides .

Now, in a nutshell,

  • Each repository is marked with the first language from the language statistics .
  • Language statistics calculates the total file size for each detected programming language or markup. Vendored files, documentation, and generated files are not counted.
  • The language of each file is determined by the open source Linguist project.



How does a linguist detect languages?

Linguist relies on the following strategies in order and returns the language as soon as he finds a perfect match (strategy with one language returned).

  • Find Emacs and Vim Models .
  • Known file name. Some file names are associated with specific languages ​​(think Makefile ).
  • Look for the shibang. A file with #!/bin/bash shebang will be classified as a Shell.
  • Known file extension. Languages ​​have a set of related extensions. However, there are many conflicts with this strategy. Conflicting results (I think C ++, C, and Objective-C for .h ) are specified in subsequent strategies.
  • A set of heuristic rules . They usually rely on regular expressions over the contents of files to try to determine the language (for example, ^[^#]+:- for Prolog ).
  • Naive Bayes classifier trained in sample files . Last strategy, low accuracy. The Bayesian classifier always accepts a subset of languages ​​as input; it is not intended to be classified among all languages. The best match found by the classifier is returned.

What are unvendored files and documentation files?

Linguist treats some files as vendors, that is, they are not included in the language statistics. These include third-party libraries, such as jQuery, and are defined in the vendor.yml configuration file. You can also sell or deploy files in your repository using Linguist overrides .

Similarly, documentation files are defined in documentation.yml and can be modified using Linguist overrides .

How are generated files detected?

Linguist relies on simple rules to detect generated files, using both paths and file contents. Generated files are not taken into account in the statistics of the language and are not displayed in diff on github.com.

What about programming and markup languages?

In a linguist, each language is given a type. These types can be found in the main configuration file, languages.yml . In statistics, only programming and markup languages ​​are taken into account.

+2
Aug 20 '17 at 10:59 on
source share

After some messing with the linguist I noticed this.

For files with Shebang , Shebang is considered when determining the language, but seems to be uniformly weighted compared to other tokens . This seems like a big mistake, because Shebang must finally determine the language of the file.

This can cause problems with the backlight.

0
Dec 21 '12 at 2:45
source share

File extensions are the first thing that comes to my mind.

-one
Mar 15 '11 at 10:01
source share



All Articles