Word language definition in C ++

After searching on Google, I donโ€™t know of any standard method or library for determining whether a particular word is in which language.

Suppose I have a word, how can I find what language it is in: English, Japanese, Italian, German, etc.

Is there a library for C ++? Any suggestion in this regard would be greatly appreciated!

+6
c ++
source share
7 answers

I assume that you are working with text, not speech.

If you work with UNICODE, than provided a slot for each language.

Thus, you can determine that all the characters of a particular word fall into this language slot.

For more information about the Unicode language slot, you can get more here

-3
source share

Simple word recognition from words is easy. You do not need to understand the semantics of the text. You do not need any computationally expensive algorithms, just a fast hash map. The problem is that you need a lot of data. Fortunately, you can find word dictionaries in each language you are interested in. Define a bitmask for each language that allows you to mark words such as "the" as recognized in several languages. Then read each language dictionary in your hash map. If the word is already in another language, simply mark the current language.

Suppose the word is written in English and French. Then when you look at it, ex ("commercial") will be displayed in ENGLISH | FRENCH, suppose ENGLISH = 1, FRENCH = 2, ... You will find the value 3. If you want to find out if the words are in your lang, you would experience:

int langs = dict["the"]; if (langs | mylang == mylang) // no other language 



Since there will be other languages, perhaps a more general approach is better. For each bit set in the vector, add 1 to the corresponding language. Do it for n words. After approximately n = 10 words in a typical text, you will have 10 for the dominant language, possibly 2 for the language to which it belongs (for example, English / French), and you can most likely determine that the text is English. Remember that even if you have a text in a language, it can still contain a quote in another, so the mere presence of a foreign word does not mean that the document is in that language. Choose a threshold, it will work very well (and very, very fast).

Obviously, the most difficult thing about this is reading in all the dictionaries. This is not a code problem, it is a data collection problem. Fortunately, this is your problem, not mine.

To do this quickly, you need to preload the hash map, otherwise it will be damaged first. If this is a problem, you will need to write hash storage and loading methods that load the whole thing efficiently.

+3
source share

I found Google CLD very useful, it is written in C ++ and from my website:

"CLD (Compact Language Detector) is a library built into the Google Chromium browser. The library detects a language from the provided UTF8 text (plain text or HTML). It is implemented in C ++ with very basic Python connections."

+3
source share

Well,

Statistically trained language detectors work surprisingly well on single-word inserts, although there are obviously some cases where they cannot work, as is the case with others.

In Java, I would send you to Apache Tika. It has an open source statistical language detector.

For C ++, you can use JNI to call it. Now it's time for a disclaimer. Since you specifically asked for C ++, and since I do not know about a free alternative to C ++, I will now point you to my employer's product, which is a detector of a statistical language, originally in C ++.

http://www.basistech.com , the product name is RLI.

+2
source share

This will not work well one word at a time, as many words are shared. For example, in several languages "means" tea ".

Language processing libraries are generally more complete than just one function, and since C ++ is a "high-performance" language, it can be difficult to find it for free.

However, the problem may not be too complicated to solve the problem. See the Wikipedia article on the problem of ideas. Also a little support for the vector machine could do the trick quite conveniently. Just train it in the most common words in their respective languages, and you can have a very efficient โ€œdatabaseโ€ in just a kilobyte or so.

+2
source share

I would not hold my breath. It is difficult to determine the language of the text automatically. If all you have is one word, without context, you will need a database of all the words of all languages โ€‹โ€‹in the world ... whose size will be prohibitive.

+1
source share

Basically, you need a huge database of all major languages. To automatically determine the language of a text fragment, select the language whose dictionary contains most of the words in the text. This is not something you would like to implement on your laptop.

+1
source share

All Articles