Convert C ++ UTF-16 to char (Linux / Ubuntu)

I am trying to help a friend with a project that was supposed to be 1st, and now has been 3 days. Needless to say, I feel very upset and angry ;-) ooooouuuu ... I'm breathing.

Thus, a program written in C ++ simply reads a bunch of files and processes them. The problem is that my program reads files that use UTF-16 encoding (because the files contain words written in different languages), and the simple use of ifstream just doesn't work (it reads and displays garbage). It took me a while to realize that this was due to the fact that the files were in UTF-16.

Now I spent literally all day on the Internet trying to find information on READING UTF16 files and converting the contents of the UTF16 line to char! I just can't seem! A nightmare. I am trying to learn about <locale> and <codecvt> , wstring, etc., which I have never used before (I specialize in graphical applications, not desktop applications). I just can't get it.

This is what I did with the tariff (but does not work):

 std::wifstream file2(fileFullPath); std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); std::cout.imbue(loc); while (!file2.eof()) { std::wstring line; std::getline(file2, line); std::wcout << line << std::endl; } 

This is the maximum that I could come up with, but it does not even work. And it does nothing better. But the problem is that I donโ€™t understand what I am doing in the first place.

WITH PLEASE PLEASE HELP! It really goes crazy that I can even read the text file G *** D ***.

Upstairs, my friend uses Ubuntu (I use clang ++), and this code requires -stdlib = libC ++, which does not seem to be supported by gcc on its side (although it uses a rather advanced version of gcc, which 4.6.3 I consider). Therefore, Iโ€™m not even sure that using codecvt and locale is a good idea (as in โ€œmaybeโ€). There would be a better (different) option.

If I convert all files to utf-8 only from the command line (using the linux command), am I going to potentially lose information?

Thank you very much, I will be grateful if you help me with this.

0
source share
3 answers

If I convert all files to utf-8 only from the command line (using the linux command), am I going to potentially lose information?

No, all UTF-16 data can be losslessly converted to UTF-8. This is probably the best.


When wide characters were entered, they were supposed to be a text representation used exclusively within the program, and were never written to disk as wide characters. Wide streams reflect this by converting the wide characters you write out to narrow the characters in the output file and convert the narrow characters in the file to wide characters in memory while reading.

 std::wofstream wout("output.txt"); wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII). std::wifstream win("ascii.txt"); std::wstring s; wout >> s; // the ascii in the file is converted to wide characters. 

Of course, the actual encoding depends on the grant codecvt in the streaming locale, but codecvt used for the stream to convert from wchar_t to char using this aspect of writing and converting from char to wchar_t when reading.


However, as some people started writing files in UTF-16, other people had to deal with this. The way they do this with C ++ streams is to create a codecvt facet that will treat the char as containing half of the UTF-16 code, which codecvt_utf16 does.

So, with this explanation, here are the problems with your code:

 std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t? std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it? // You need to imbue file2 here, not cout. while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly std::wstring line; std::getline(file2, line); std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data } 

Here is one way to rewrite the above:

 // when reading UTF-16 you must use binary mode std::wifstream file2(fileFullPath, std::ios::binary); // ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux) static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough"); // imbue file2 so that it will convert a UTF-16 file into wchar_t data. // If the UTF-16 files are generated on Windows then you probably want to // consume the BOM Windows uses std::locale loc( std::locale(), new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>); file2.imbue(loc); // imbue wcout so that wchar_t data printed will be converted to the system's // encoding (which is probably UTF-8). std::wcout.imbue(std::locale("")); // Note that the above is doing something that one should not do, strictly // speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`, // UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding // used in other locales such as std::locale(""). Fortunately locales that use // UTF-8 as the narrow encoding will generally also use UTF-32 as the wide // encoding, coincidentally making this code work std::wstring line; while (std::getline(file2, line)) { std::wcout << line << std::endl; } 
+1
source

I adapted, fixed and tested Mats Peterson's impressive solution.

 int utf16_to_utf32(std::vector<int> &coded) { int t = coded[0]; if (t & 0xFC00 != 0xD800) { return t; } int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10); charcode += 0x10000; return charcode; } #ifdef __cplusplus // If used by C++ code, extern "C" { // we need to export the C interface #endif void convert_utf16_to_utf32(UTF16 *input, size_t input_size, UTF32 *output) { const UTF16 * const end = input + 1 * input_size; while (input < end){ const UTF16 uc = *input++; std::vector<int> vec; // endianess vec.push_back(U16_LEAD(uc) & oxFF); printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF); vec.push_back(U16_TRAIL(uc) & oxFF); printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF); *output++ = utf16_to_utf32(vec); } } #ifdef __cplusplus } #endif 
0
source

UTF-8 is capable of displaying all valid Unicode characters (code points), which is better than UTF-16 (which covers the first 1.1 million code points). [Although, as the comment explains, there are no valid Unicode codes that exceed 1.1 million, so UTF-16 is โ€œsafeโ€ for all currently available code points - and probably for a long time, if only we get additional ground visitors who have a very complicated input language ...]

He does this if necessary, using several bytes / words to store one code point (which we will call a symbol). In UTF-8, this is indicated by the maximum bit set - in the first byte of the "multibyte" character, the upper two bits are set, and in the next byte (s), the upper bit is set, and the next one from the top is zero.

To convert an arbitrary code point to UTF-8, you can use the code in the previous answer from me. (Yes, this question says the opposite is what you are asking for, but the code in my answer covers both directions of conversion)

Converting from UTF16 to "integer" will be a similar method, except for the input length. If you're lucky, you might even avoid it without doing it ...

UTF16 uses the D800-DBFF range as the first part, which contains 10 bits of data, and then the next element, DC00-DFFF, containing the next 10 bits of data.

Code for 16-bit to follow ...

Code for 16-bit-32-bit conversion (I only checked this a bit, but it works fine):

 std::vector<int> utf32_to_utf16(int charcode) { std::vector<int> r; if (charcode < 0x10000) { if (charcode & 0xFC00 == 0xD800) { std::cerr << "Error bad character code" << std::endl; exit(1); } r.push_back(charcode); return r; } charcode -= 0x10000; if (charcode > 0xFFFFF) { std::cerr << "Error bad character code" << std::endl; exit(1); } int coded = 0xD800 | ((charcode >> 10) & 0x3FF); r.push_back(coded); coded = 0xDC00 | (charcode & 0x3FF); r.push_back(coded); return r; } int utf16_to_utf32(std::vector<int> &coded) { int t = coded[0]; if (t & 0xFC00 != 0xD800) { return t; } int charcode = (coded[1] & 0x3FF) | ((t & 0x3FF) << 10); charcode += 0x10000; return charcode; } 
-1
source

All Articles