Correct reading of utf-16 text file into a string without external libraries?

Question

Correct reading of utf-16 text file into a string without external libraries?

I have been using StackOverflow from the very beginning and sometimes am tempted to ask questions, but I always either figured them out on my own or found answers in the end ... so far. It seems like it should be pretty simple, but I wandered the internet for hours without success, so I go here:

I have a pretty standard utf-16 text file with a mixture of English and Chinese characters. I would like these characters to appear in a string (technically, wstring). I saw the answers to many related questions (here and elsewhere), but they either look for a solution to the much more complex problem of reading arbitrary files without knowing the encodings, or converting the encodings, or are simply confused about the "Unicode" range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a specification and that's it, and it can stay that way.

I used the solution described here , which worked for text files that were all in English, but after meeting with certain characters, he stopped reading the file. The only other suggestion I found was to use an ICU , which is likely to work, but I would prefer not to include the entire large library in the distribution application, just read one text file in one place. I do not care about the independence of the system, but I only need to compile it and work in Windows. Of course, a solution that did not rely on this fact would be prettier, but I would also be pleased with the solution that stl used, relying on assumptions about the Windows architecture or even solutions that included win32 or ATL functions; I just don't want to include another large third-party library such as ICU. Are I still unlucky if I do not want to fully realize all this?

edit: I was stuck using VS2008 for this particular project, so C ++ 11 code unfortunately won't help.

edit 2: I realized that the code I borrowed earlier did not fail on non-English characters, as I thought this was happening. Rather, in my test document it doesn’t work with certain characters, including: "(FULLWIDTH COLON, U + FF1A) and") "(FULLWIDTH RIGHT PARENTHESIS, U + FF09). The released bames53 solution also basically works, but it’s obstructed by those the same characters?

edit 3 (and the answer!): The source code I used -did basically works, since bames53 helped me find ifstream just needed to be opened in binary mode for it to work.

+6

c ++ winapi unicode utf-16

neminem May 08 '12 at 18:08

source share

3 answers

C ++ 11 solution (supported on your Visual Studio platform since 2010, as far as I know):

#include <fstream> #include <iostream> #include <locale> #include <codecvt> int main() { // open as a byte stream std::wifstream fin("text.txt", std::ios::binary); // apply BOM-sensitive UTF-16 facet fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>)); // read for(wchar_t c; fin.get(c); ) std::cout << std::showbase << std::hex << c << '\n'; }

+10

Cubbi May 08 '12 at 18:25

source share

Edit:

So it looks like the problem was that Windows treats certain magic byte sequences as the end of the file in text mode. This can be solved using binary mode to read the file, std::ifstream fin("filename", std::ios::binary); and then copying the data to wstring how you do it.

The simplest, non-portable solution would be to simply copy the file data into the wchar_t array. It depends on the fact that wchar_t on Windows has 2 bytes and uses UTF-16 as its encoding.

You will have a bit of trouble converting UTF-16 to a locale-specific wchar_t encoding in a fully portable way.

Here the Unicode conversion function is available in the C ++ standard library (although VS 10 and 11 only implement items 3, 4 and 5)

codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb / mbrtoc32
c16rtomb / mbrtoc16

And what everyone does

The codecvt attribute that always converts UTF-8 and UTF-32
converts UTF-8 and UTF-16
converts UTF-8 and UCS-2 or UCS-4 depending on the size of the target element (characters outside the BMP are probably truncated)
converts a character sequence using a UTF-16 and UCS-2 or UCS-4 encoding scheme
converts UTF-8 and UTF-16
If the __STDC_UTF_32__ macro __STDC_UTF_32__ defined, these functions convert between the current char encoding and UTF-32
If the __STDC_UTF_16__ macro __STDC_UTF_16__ defined, these functions convert between the current char encoding and UTF-16

If __STDC_ISO_10646__ defined, then the conversion directly with codecvt_utf16<wchar_t> should be fine, as this macro indicates that the wchar_t values in all locales correspond to short Unicode char names (and therefore implies that wchar_t is large enough to hold any such values).

Unfortunately, nothing is defined that goes directly from UTF-16 to wchar_t. You can go UTF-16 → UCS-4 → mb (if __STDC_UTF_32__ ) → wc, but you will lose everything that did not appear in the multibyte encoding of the locale. And, of course, in spite of everything, the conversion from UTF-16 to wchar_t will lose everything that cannot be represented in the wchar_t encoding of the locale.

So it's probably not worth the carry, and instead you can just read the data in the wchar_t array or use some other Windows tools such as _O_U16TEXT mode for files.

This should build and work anywhere, but it actually creates a bunch of assumptions:

 #include <fstream> #include <sstream> #include <iostream> int main () { std::stringstream ss; std::ifstream fin("filename"); ss << fin.rdbuf(); // dump file contents into a stringstream std::string const &s = ss.str(); if (s.size()%sizeof(wchar_t) != 0) { std::cerr << "file not the right size\n"; // must be even, two bytes per code unit return 1; } std::wstring ws; ws.resize(s.size()/sizeof(wchar_t)); std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring }

You should probably at least add code to handle endianess and "BOM". In addition, new Windows strings are not automatically converted, so you need to do this manually.

+4

bames53 May 08 '12 at 20:04

source share

Mark ransom · Accepted Answer · 2012-05-09T03:30:52+0000

When you open a file for UTF-16, you must open it in binary mode. This is due to the fact that in text mode certain characters are interpreted specifically - in particular, 0x0d is completely filtered out, and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of these bytes as half the character code and spoil the reading of the file. This is not a mistake, it is a deliberate behavior and is the only reason for the presence of separate text and binary modes.

For this reason, 0x1a is considered the end of the file, see this blog.

Correct reading of utf-16 text file into a string without external libraries?

Edit:

More articles: