Simple word recognition from words is easy. You do not need to understand the semantics of the text. You do not need any computationally expensive algorithms, just a fast hash map. The problem is that you need a lot of data. Fortunately, you can find word dictionaries in each language you are interested in. Define a bitmask for each language that allows you to mark words such as "the" as recognized in several languages. Then read each language dictionary in your hash map. If the word is already in another language, simply mark the current language.
Suppose the word is written in English and French. Then when you look at it, ex ("commercial") will be displayed in ENGLISH | FRENCH, suppose ENGLISH = 1, FRENCH = 2, ... You will find the value 3. If you want to find out if the words are in your lang, you would experience:
int langs = dict["the"]; if (langs | mylang == mylang)
Since there will be other languages, perhaps a more general approach is better. For each bit set in the vector, add 1 to the corresponding language. Do it for n words. After approximately n = 10 words in a typical text, you will have 10 for the dominant language, possibly 2 for the language to which it belongs (for example, English / French), and you can most likely determine that the text is English. Remember that even if you have a text in a language, it can still contain a quote in another, so the mere presence of a foreign word does not mean that the document is in that language. Choose a threshold, it will work very well (and very, very fast).
Obviously, the most difficult thing about this is reading in all the dictionaries. This is not a code problem, it is a data collection problem. Fortunately, this is your problem, not mine.
To do this quickly, you need to preload the hash map, otherwise it will be damaged first. If this is a problem, you will need to write hash storage and loading methods that load the whole thing efficiently.
Dov
source share