How to convert (not necessarily programmatically) between Windows wchar_t and GCC / Linux?

Suppose I have this wchar_t windows line:

L"\x4f60\x597d" 

and

 L"\x00e4\x00a0\x597d" 

and would like to convert it (not necessarily programmatically, it will be one-time) to the GCC / Linux wchar_t format, which is UTF-32 AFAIK. How can I do it? (a general explanation would be good, but an example based on this particular case would also be useful)

Please do not direct me to character conversion sites. I would like to convert from the form L "\ x (something)", and not the form "end character".

+4
source share
4 answers

One of the most commonly used libraries for character conversion is the ICU library http://icu-project.org/ This, for example, is used by some boost http://www.boost.org/ .

0
source

Would converting from UTF-16 (Visual C ++ wchar_t form) to UTF-8, and then possibly from UTF-8 to UCS-4 (GCC wchar_t form), be an acceptable answer?

If so, then on Windows you can use the WideCharToMultiByte function (with CP_UTF8 for the CodePage parameter), for the first part of the conversion. Then you can paste the received UTF-8 strings directly into your program or convert them further. Here is a message showing how one person did it; You can also write your own code or do it manually (the official specification with a section on how to convert UTF-8 to UCS-4 can be found here ). There may be an easier way; I am not too familiar with Linux conversion products.

+2
source

You only need to worry about characters between \ xD800 and \ xDFFF inclusive . Each other character should display the same from UTF-16 to UCS-4 at zero fill.

+2
source

Ignacio is right, if you do not use some rare Chinese characters (or some extinct scripts), then the matching happens one after another. (official "jargon" - "if you do not have characters outside the BMP")

This is an algorithm, just in case: http://unicode.org/faq/utf_bom.html#utf16-3 But then again, most likely, it is useless for your real business.

You can also use free sources from Unicode ( ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF )

0
source

All Articles