The problem with locating a language is that it will never be completely accurate. My browser quite often incorrectly identifies the language, and it was done by google, which probably set itself big minds.
However, consider a few points:
I'm not sure what the Perls Lingua::Identify
module really uses, but most often these tasks are handled using Naive Baysian models, as someone pointed out in another answer. Baysian models use the probability to classify into several categories, in your case it will be a different language. Now these probabilities are like dependent probabilities, i.e. How often does a certain function appear for each category, as well as independent (previous) probabilities, i.e. How often each category appears as a whole.
Since both of these data are used, you are likely to get low quality predictions when priors are wrong. I believe that Linua::Identify
is mainly Linua::Identify
body of an online document, so the highest one before that is likely to be English. This means that Lingua::Identify
will most likely classify your documents as English if it has no good reason to believe otherwise (in your case, this most likely has a good reason, because you say that your documents are mistakenly classified as Italian , French and Spanish).
This means that you should try to re-prepare your model, if possible. There may be some methods in Lingua::Identify
to help you with this. If not, I would suggest you write your own Naive Bayes classifier (this is pretty simple actually).
If you have a Naive Bayes classifier, you need to solve a set of functions. Most often, letter frequencies are very characteristic of each language, so this will be the first guess. First, try to train your classifier at these frequencies first. The naive Bayes classifier is used in spam filters, so you can train it as one of them. Ask him to run the sample, and whenever you get an erroneous classification, update the classifier to the correct classification. After some time, it will become less and less.
If a single-letter frequency does not give you good enough results, you can try using n-grams instead (however, keep in mind the combinatorial explosion that this will lead to). I would never suggest trying anything more than 3 grams. If this still does not give you good results, try manually identifying unique common words in each language and adding them to your feature set. I am sure that as soon as you start experimenting with this, you will get more ideas for functions that you can try.
Another nice thing about the approach using Bayesian classifiers is that you can always add new information if more documents arrive that do not match the prepared data. In this case, you can simply reclassify several new documents and similarly to a spam filter, which the classifier will adapt to a changing environment.