First, be aware that you can override the detected language for files in your repository using Linguist overrides .
Now, in a nutshell,
- Each repository is marked with the first language from the language statistics .
- Language statistics calculates the total file size for each detected programming language or markup. Vendored files, documentation, and generated files are not counted.
- The language of each file is determined by the open source Linguist project.
How does a linguist detect languages?
Linguist relies on the following strategies in order and returns the language as soon as he finds a perfect match (strategy with one language returned).
- Find Emacs and Vim Models .
- Known file name. Some file names are associated with specific languages ββ(think
Makefile ). - Look for the shibang. A file with
#!/bin/bash shebang will be classified as a Shell. - Known file extension. Languages ββhave a set of related extensions. However, there are many conflicts with this strategy. Conflicting results (I think C ++, C, and Objective-C for
.h ) are specified in subsequent strategies. - A set of heuristic rules . They usually rely on regular expressions over the contents of files to try to determine the language (for example,
^[^#]+:- for Prolog ). - Naive Bayes classifier trained in sample files . Last strategy, low accuracy. The Bayesian classifier always accepts a subset of languages ββas input; it is not intended to be classified among all languages. The best match found by the classifier is returned.
What are unvendored files and documentation files?
Linguist treats some files as vendors, that is, they are not included in the language statistics. These include third-party libraries, such as jQuery, and are defined in the vendor.yml configuration file. You can also sell or deploy files in your repository using Linguist overrides .
Similarly, documentation files are defined in documentation.yml and can be modified using Linguist overrides .
How are generated files detected?
Linguist relies on simple rules to detect generated files, using both paths and file contents. Generated files are not taken into account in the statistics of the language and are not displayed in diff on github.com.
What about programming and markup languages?
In a linguist, each language is given a type. These types can be found in the main configuration file, languages.yml . In statistics, only programming and markup languages ββare taken into account.
pchaigno Aug 20 '17 at 10:59 on 2017-08-20 10:59
source share