Edit:
So it looks like the problem was that Windows treats certain magic byte sequences as the end of the file in text mode. This can be solved using binary mode to read the file, std::ifstream fin("filename", std::ios::binary); and then copying the data to wstring how you do it.
The simplest, non-portable solution would be to simply copy the file data into the wchar_t array. It depends on the fact that wchar_t on Windows has 2 bytes and uses UTF-16 as its encoding.
You will have a bit of trouble converting UTF-16 to a locale-specific wchar_t encoding in a fully portable way.
Here the Unicode conversion function is available in the C ++ standard library (although VS 10 and 11 only implement items 3, 4 and 5)
codecvt<char32_t,char,mbstate_t>codecvt<char16_t,char,mbstate_t>- codecvt_utf8
- codecvt_utf16
- codecvt_utf8_utf16
- c32rtomb / mbrtoc32
- c16rtomb / mbrtoc16
And what everyone does
- The codecvt attribute that always converts UTF-8 and UTF-32
- converts UTF-8 and UTF-16
- converts UTF-8 and UCS-2 or UCS-4 depending on the size of the target element (characters outside the BMP are probably truncated)
- converts a character sequence using a UTF-16 and UCS-2 or UCS-4 encoding scheme
- converts UTF-8 and UTF-16
- If the
__STDC_UTF_32__ macro __STDC_UTF_32__ defined, these functions convert between the current char encoding and UTF-32 - If the
__STDC_UTF_16__ macro __STDC_UTF_16__ defined, these functions convert between the current char encoding and UTF-16
If __STDC_ISO_10646__ defined, then the conversion directly with codecvt_utf16<wchar_t> should be fine, as this macro indicates that the wchar_t values ββin all locales correspond to short Unicode char names (and therefore implies that wchar_t is large enough to hold any such values).
Unfortunately, nothing is defined that goes directly from UTF-16 to wchar_t. You can go UTF-16 β UCS-4 β mb (if __STDC_UTF_32__ ) β wc, but you will lose everything that did not appear in the multibyte encoding of the locale. And, of course, in spite of everything, the conversion from UTF-16 to wchar_t will lose everything that cannot be represented in the wchar_t encoding of the locale.
So it's probably not worth the carry, and instead you can just read the data in the wchar_t array or use some other Windows tools such as _O_U16TEXT mode for files.
This should build and work anywhere, but it actually creates a bunch of assumptions:
#include <fstream> #include <sstream> #include <iostream> int main () { std::stringstream ss; std::ifstream fin("filename"); ss << fin.rdbuf(); // dump file contents into a stringstream std::string const &s = ss.str(); if (s.size()%sizeof(wchar_t) != 0) { std::cerr << "file not the right size\n"; // must be even, two bytes per code unit return 1; } std::wstring ws; ws.resize(s.size()/sizeof(wchar_t)); std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring }
You should probably at least add code to handle endianess and "BOM". In addition, new Windows strings are not automatically converted, so you need to do this manually.