Detecting a language with data in PostgreSQL

I have a table in PostgreSQL where the column is the text. I need a library or tool that can identify the language of each text for testing purposes.

There is no need for PostgreSQL code, because I have problems installing languages, but any language that can connect to the database, extract texts and identify it is welcome.

I used Lingua::Identify , suggested in the answers right in the Perl script, it worked, but the results are inaccurate.

The texts that I want to identify come from the Internet, and most of them are in Portuguese, but Lingua::Identify classifies in the same way as French, Italian and Spanish, which are similar languages.

I need something more accurate.

I added java and r tags because these are the languages ​​that I use in the system and in the solution, with the help of which they will be easily implemented, but solutions in any language are welcome.

+7
source share
6 answers

You can use PL / Perl ( CREATE FUNCTION langof(text) LANGUAGE plperlu AS ... ) with Lingua :: Define CPAN Module.

Perl script:

 #!/usr/bin/perl use Lingua::Identify qw(langof); undef $/; my $textstring = <>; ## warning - slurps whole file to memory my $a = langof( $textstring ); # gives the most probable language print "$a\n"; 

And function:

 create or replace function langof( text ) returns varchar(2) immutable returns null on null input language plperlu as $perlcode$ use Lingua::Identify qw(langof); return langof( shift ); $perlcode$; 

Works for me:

 filip@filip =# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy'); langof -------- pl (1 row) Time: 1.801 ms 

PL / Perl on Windows

PL / Perl language libary (plperl.dll) is supplied in the preinstalled Windows installer postgres.

But to use PL / Perl, you need a Perl interpreter. In particular, Perl 5.14 (at the time of this writing). The most common installer is ActiveState, but it is not free. Free from StrawberryPerl . Make sure you have PERL514.DLL .

After installing Perl, log in to your postgres database and try running

 CREATE LANGUAGE plperlu; 

Language identification library

If quality bothers you, you have several options: you can improve Lingua :: Identify yourself (this is open source) or try a different library. I found this one that is commercial but looks promising.

+7
source

Naive Bayes classifiers are very good at identifying a language , you find implementations in all the main languages, or you can implement them yourself, it is not very difficult. Interest is also interesting on Wikipedia: https://en.wikipedia.org/wiki/Naive_Bayes_classifier .

+4
source

The problem with locating a language is that it will never be completely accurate. My browser quite often incorrectly identifies the language, and it was done by google, which probably set itself big minds.

However, consider a few points:

I'm not sure what the Perls Lingua::Identify module really uses, but most often these tasks are handled using Naive Baysian models, as someone pointed out in another answer. Baysian models use the probability to classify into several categories, in your case it will be a different language. Now these probabilities are like dependent probabilities, i.e. How often does a certain function appear for each category, as well as independent (previous) probabilities, i.e. How often each category appears as a whole.

Since both of these data are used, you are likely to get low quality predictions when priors are wrong. I believe that Linua::Identify is mainly Linua::Identify body of an online document, so the highest one before that is likely to be English. This means that Lingua::Identify will most likely classify your documents as English if it has no good reason to believe otherwise (in your case, this most likely has a good reason, because you say that your documents are mistakenly classified as Italian , French and Spanish).

This means that you should try to re-prepare your model, if possible. There may be some methods in Lingua::Identify to help you with this. If not, I would suggest you write your own Naive Bayes classifier (this is pretty simple actually).

If you have a Naive Bayes classifier, you need to solve a set of functions. Most often, letter frequencies are very characteristic of each language, so this will be the first guess. First, try to train your classifier at these frequencies first. The naive Bayes classifier is used in spam filters, so you can train it as one of them. Ask him to run the sample, and whenever you get an erroneous classification, update the classifier to the correct classification. After some time, it will become less and less.

If a single-letter frequency does not give you good enough results, you can try using n-grams instead (however, keep in mind the combinatorial explosion that this will lead to). I would never suggest trying anything more than 3 grams. If this still does not give you good results, try manually identifying unique common words in each language and adding them to your feature set. I am sure that as soon as you start experimenting with this, you will get more ideas for functions that you can try.

Another nice thing about the approach using Bayesian classifiers is that you can always add new information if more documents arrive that do not match the prepared data. In this case, you can simply reclassify several new documents and similarly to a spam filter, which the classifier will adapt to a changing environment.

+3
source

I found a library called TextCat , which is available on LGPL. I can’t say what the quality of his identification is, but he received an online demo form, so maybe you can throw some text on it before deciding whether to download it.

It is also written in Perl, so if you want to use it, the filiprem approach would be a good starting point.

+2
source

There is also a language detection web service that provides both free and premium services at http://detectlanguage.com

It has Ruby and PHP clients, but can be accessed from any simple web request in any language. The output is in JSON.

0
source

All Articles