Locale detection from unicode string in C ++

I have a string and I want to check if the content is in English or Hindi (my local language). I realized that the unicode range for the Hindi character is from U0900-U097F.

What is the easiest way to find if a string has any characters in this range?

I can use std :: string or Glib :: ustring depending on which is convenient.

+6
c ++ unicode
source share
3 answers

Here's how you do it with Glib :: ustring:

using Glib::ustring; ustring x("เคธเคนเคธ"); // hindi string bool is_hindi = false; for (ustring::iterator i = x.begin(); i != x.end(); i ++) if (*i >= 0x0900 && *i <= 0x097f) is_hindi = true; 
+2
source share

The first step is to write a functor to determine if a given wchar_t is Hindi. This will be (retrieved from) a std::unary_function<wchar_t, bool> . The implementation is trivial: return c>= 0x0900 && c < 0x980; . The second step uses it: std::find_if(begin, end, is_hindi()) .

Since you will need Unicode, you should probably use wchar_t and therefore std::wstring . Neither std::string nor GLib::ustring supports Unicode. In some systems (in particular, Windows), the implementation of wchar_t limited to Unicode 4 = 16 bits, but this should be enough for 99.9% of the world's population.

You will need to convert from / to UTF-8 to I / O, but the advantage of "one character = one wchar_t" is great. For example, std::wstring::substr() will work reasonably. However, you may have problems with "characters" such as U + 094B (DEVANAGARI VOWEL SIGN O). When iterating over std :: wstring, which will be displayed as a character in itself, instead of a modifier. This is still better than std :: string with UTF-8, where you end up repeating on single bytes of U + 094B. And to take only your original examples, none of the bytes in UTF8(U+094B) are reserved for Hindi.

+1
source share

If the string is already encoded as UTF-8, I would not convert it to UTF-16 (I assume that MSalters calls "Unicode itself"), but iterates over the encoded UTF-8 encoding and checks for the presence of the Hindi character in it.

With std :: string, you can easily iterate using the UTF8-CPP library: - take a look at utf8 :: next () or the iterator class.

GLib :: ustring has an iterator that seems to support the same functionality (not tried):

+1
source share

All Articles