Problem with getline and "weird characters"

I have a strange problem, I use

wifstream a("a.txt"); wstring line; while (a.good()) //!a.eof() not helping { getline (a,line); //... wcout<<line<<endl; } 

and it works fine for a txt file like this http://www.speedyshare.com/files/29833132/a.txt (sorry for the link, but this is only 80 bytes, so this should not be a problem to get it, if I c / p on SO newlines gets lost) BUT when I add, for example, ๆฐด (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples ) to any line that is a line where the download stops. I was under the wrong impression that getline, which accepts wstring as one input and wifstream, as others can chew on any txt input ... Is there a way to read every line in a file, even if it contains funky?

+4
source share
3 answers

The not-so-satisfactory answer is that you need to inject the input stream into a locale that understands the specific character encoding. If you donโ€™t know which language to choose, you can use an empty language.

For example (untested):

 std::wifstream a("a.txt"); std::locale loc(""); a.imbue(loc); 

Unfortunately, there is no standard way to determine which locales are available for a given platform, not to mention choosing one based on character encoding.

The above code puts the choice of language in the hands of the user, and if he installs it on something believable (for example, en_AU.UTF-8 ), it can all just work.

Otherwise, you may have to go to third-party libraries such as iconv or ICU .

This blog post is also appropriate (apologies for self-promotion).

+6
source

The problem is calling the global function getline (a,line) . It takes std::string . Use the std::wistream::getline method instead of the getline function.

+3
source

C ++ fstreams delegates I / O to its file files. filebufs always reads the "raw bytes" from the disk, and then uses the edge of the streamecvt codec to convert these raw bytes to their "internal encoding".

A wfstream is basic_fstream<wchar_t> and therefore has basic_filebuf<wchar_t> , which uses locale codecvt<wchar_t, char> to convert bytes read from disk to wchar_t s. If you are reading a UCS-2 encoded file, the conversion must be done using a codec that โ€œknowsโ€ that the external encoding is UCS-2. So you need a locale with such a codec (see, for example, this SO question )

By default, a stream locale is a global locale when building a stream. To use a specific language, it must be imbue() -d in the stream.

+3
source

All Articles