UTF-16 truncated read in C ++

Question

UTF-16 truncated read in C ++

My goal is to convert external input sources to a common internal UTF-8 encoding, as it is compatible with many libraries that I use (like RE2) and is compact. Since I don’t need to do line cutting except for pure ASCII, UTF-8 is the perfect format for me. Now from the external input formats that I have to decode includes UTF-16.

To test the reading of UTF-16 (either high or low) in C ++, I converted the test file UTF-8 to UTF-16 LE and UTF-16 BE. The file is a simple gibberish in CSV format with many different source languages (English, French, Japanese, Korean, Arabic, Spanish, Thai) to create a fairly complex file:

"This","佐藤 幹夫","Mêmes","친구" "ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,","🛂"

UTF-8 Example

Now parsing this file encoded in UTF-8 using the following code gives the expected result (I understand that this example is mostly artificial, since my system encoding is UTF-8, and therefore no actual conversion to wide characters, and then back for bytes):

 #include <sstream> #include <locale> #include <iostream> #include <fstream> #include <codecvt> std::wstring readFile(const char* filename) { std::wifstream wif(filename, std::ios::binary); wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff>)); std::wstringstream wss; wss << wif.rdbuf(); return wss.str(); } int main() { std::wstring read = readFile("utf-8.csv"); std::cout << read.size() << std::endl; using convert_type = std::codecvt_utf8<wchar_t>; std::wstring_convert<convert_type, wchar_t> converter; std::string converted_str = converter.to_bytes( read ); std::cout << converted_str; return 0; }

When the file compiles and runs (on Linux, therefore the UTF-8 system encoding), I get the following output:

 $ g++ utf8.cpp -o utf8 -std=c++14 $ ./utf8 73 "This","佐藤 幹夫","Mêmes","친구" "ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,","🛂"

UTF-16 Example

However, when I try to use a similar example with UTF-16, I get a truncated file, despite loading the files correctly in text editors, Python, etc.

 #include <fstream> #include <sstream> #include <iostream> #include <locale> #include <codecvt> #include <string> std::wstring readFile(const char* filename) { std::wifstream wif(filename, std::ios::binary); wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff>)); std::wstringstream wss; wss << wif.rdbuf(); return wss.str(); } int main() { std::wstring read = readFile("utf-16.csv"); std::cout << read.size() << std::endl; using convert_type = std::codecvt_utf8<wchar_t>; std::wstring_convert<convert_type, wchar_t> converter; std::string converted_str = converter.to_bytes( read ); std::cout << converted_str; return 0; }

When the file compiles and runs (on Linux, so the system encoding is UTF-8), I get the following output for the small endian format:

 $ g++ utf16.cpp -o utf16 -std=c++14 $ ./utf16 19 "This","PO

For the big-endian format, I get the following:

 $ g++ utf16.cpp -o utf16 -std=c++14 $ ./utf16 19 "This","OP

Interestingly, the CJK character must be part of the Basic Multilingual Plane, but obviously not properly converted, and the file has been truncated before. The same problem occurs with line building.

Other resources

I have checked the following resources before, this answer is most noteworthy, as well as this answer . None of their decisions turned out to be fruitful for me.

Other features

 LANG = en_US.UTF-8 gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)

Any other details, and I will be happy to provide them. Thanks.

edits

Adrian mentioned in the comments should provide the hexdump that is shown for "utf-16le", a UTF-16 restricted encoding file:

 0000000 0022 0054 0068 0069 0073 0022 002c 0022 0000010 4f50 85e4 0020 5e79 592b 0022 002c 0022 0000020 004d 00ea 006d 0065 0073 0022 002c 0022 0000030 ce5c ad6c 0022 000a 0022 0e20 0e04 0e27 0000040 0e32 0022 002c 0022 0020 0643 064a 0628 0000050 0648 0631 062f 0020 0644 0644 0643 062a 0000060 0627 0628 0629 0020 0628 0627 0644 0639 0000070 0631 0628 064a 0022 002c 0022 30a6 30a5 0000080 30ad 30e5 002c 0022 002c 0022 d83d dec2 0000090 0022 000a 0000094

qexyn mentioned removing the std::ios::binary flag, which I tried but didn't change anything.

Finally, I tried using iconv to make sure these are valid files using both the command line utility and the C module.

$ iconv -f="UTF-16BE" -t="UTF-8" utf-16be.csv "This","佐藤幹夫","Mêmes","친구" "ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,","🛂"

Obviously, iconv has no problems with the source files. This leads me to use iconv as it is cross-platform, easy to use and tested, but if anyone has an answer with a standard library, I will gladly accept it.

+5

c ++ encoding c ++ 11 utf-8 utf-16

Alexander Huszagh 12 sept '16 at 0:02

source share

1 answer

Alexander Huszagh · Accepted Answer · 2016-09-13T02:46:02+0000

So I'm still waiting for a potential answer using the standard C ++ library, but I had no success, so I wrote an implementation that works with Boost and iconv (which are pretty common dependencies). It consists of a header and a source file, it works all of the above situations, is quite effective, can take any pair of iconv encodings and wrap a stream object to provide easy integration into existing code. Since I'm pretty new to C ++, I would check the code if you decide to implement it yourself: I am far from expert.

encoding.hpp

 #pragma once #include <iostream> #if defined(_MSC_VER) && (_MSC_VER >= 1020) # pragma once #endif #include <cassert> #include <iosfwd> // streamsize. #include <memory> // allocator, bad_alloc. #include <new> #include <string> #include <boost/config.hpp> #include <boost/cstdint.hpp> #include <boost/detail/workaround.hpp> #include <boost/iostreams/constants.hpp> #include <boost/iostreams/detail/config/auto_link.hpp> #include <boost/iostreams/detail/config/dyn_link.hpp> #include <boost/iostreams/detail/config/wide_streams.hpp> #include <boost/iostreams/detail/config/zlib.hpp> #include <boost/iostreams/detail/ios.hpp> #include <boost/iostreams/filter/symmetric.hpp> #include <boost/iostreams/pipeline.hpp> #include <boost/type_traits/is_same.hpp> #include <boost/iostreams/filter/zlib.hpp> #include <iconv.h> // Must come last. #ifdef BOOST_MSVC # pragma warning(push) # pragma warning(disable:4251 4231 4660) // Dependencies not exported. #endif #include <boost/config/abi_prefix.hpp> #undef small namespace boost { namespace iostreams { // CONSTANTS // --------- extern const size_t maxUnicodeWidth; // OBJECTS // ------- /** @brief Parameters for input and output encodings to pass to iconv. */ struct encoded_params { std::string input; std::string output; encoded_params(const std::string &input = "UTF-8", const std::string &output = "UTF-8"): input(input), output(output) {} }; namespace detail { // DETAILS // ------- /** @brief Base class for the character set conversion filter. * Contains a core process function which converts the source * encoding to the destination encoding. */ class BOOST_IOSTREAMS_DECL encoded_base { public: typedef char char_type; protected: encoded_base(const encoded_params & params = encoded_params()); ~encoded_base(); int convert(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end); int copy(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end); int process(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end, int /* flushLevel */); public: int total_in(); int total_out(); private: iconv_t conv; bool differentCharset; }; /** @brief Template implementation for the encoded writer. * * Model of a C-style file filter for character set conversions, via * iconv. */ template<typename Alloc = std::allocator<char> > class encoded_writer_impl : public encoded_base { public: encoded_writer_impl(const encoded_params &params = encoded_params()); ~encoded_writer_impl(); bool filter(const char*& src_begin, const char* src_end, char*& dest_begin, char* dest_end, bool flush); void close(); }; /** @brief Template implementation for the encoded reader. * * Model of a C-style file filter for character set conversions, via * iconv. */ template<typename Alloc = std::allocator<char> > class encoded_reader_impl : public encoded_base { public: encoded_reader_impl(const encoded_params &params = encoded_params()); ~encoded_reader_impl(); bool filter(const char*& begin_in, const char* end_in, char*& begin_out, char* end_out, bool flush); void close(); bool eof() const { return eof_; } private: bool eof_; }; } /* detail */ // FILTERS // ------- /** @brief Model of InputFilter and OutputFilter implementing * character set conversion via iconv. */ template<typename Alloc = std::allocator<char> > struct basic_encoded_writer : symmetric_filter<detail::encoded_writer_impl<Alloc>, Alloc> { private: typedef detail::encoded_writer_impl<Alloc> impl_type; typedef symmetric_filter<impl_type, Alloc> base_type; public: typedef typename base_type::char_type char_type; typedef typename base_type::category category; basic_encoded_writer(const encoded_params &params = encoded_params(), int buffer_size = default_device_buffer_size); int total_in() { return this->filter().total_in(); } }; BOOST_IOSTREAMS_PIPABLE(basic_encoded_writer, 1) typedef basic_encoded_writer<> encoded_writer; /** @brief Model of InputFilter and OutputFilter implementing * character set conversion via iconv. */ template<typename Alloc = std::allocator<char> > struct basic_encoded_reader : symmetric_filter<detail::encoded_reader_impl<Alloc>, Alloc> { private: typedef detail::encoded_reader_impl<Alloc> impl_type; typedef symmetric_filter<impl_type, Alloc> base_type; public: typedef typename base_type::char_type char_type; typedef typename base_type::category category; basic_encoded_reader(const encoded_params &params = encoded_params(), int buffer_size = default_device_buffer_size); int total_out() { return this->filter().total_out(); } bool eof() { return this->filter().eof(); } }; BOOST_IOSTREAMS_PIPABLE(basic_encoded_reader, 1) typedef basic_encoded_reader<> encoded_reader; namespace detail { // IMPLEMENTATION // -------------- /** @brief Initialize the encoded writer with the iconv parameters. */ template<typename Alloc> encoded_writer_impl<Alloc>::encoded_writer_impl(const encoded_params& p): encoded_base(p) {} /** @brief Close the encoded writer. */ template<typename Alloc> encoded_writer_impl<Alloc>::~encoded_writer_impl() {} /** @brief Implementation of the symmetric, character set encoding filter * for the writer. */ template<typename Alloc> bool encoded_writer_impl<Alloc>::filter (const char*& src_begin, const char* src_end, char*& dest_begin, char* dest_end, bool flush) { int result = process(src_begin, src_end, dest_begin, dest_end, flush); return result == -1; } /** @brief Close the encoded writer. */ template<typename Alloc> void encoded_writer_impl<Alloc>::close() {} /** @brief Close the encoded reader. */ template<typename Alloc> encoded_reader_impl<Alloc>::~encoded_reader_impl() {} /** @brief Initialize the encoded reader with the iconv parameters. */ template<typename Alloc> encoded_reader_impl<Alloc>::encoded_reader_impl(const encoded_params& p): encoded_base(p), eof_(false) {} /** @brief Implementation of the symmetric, character set encoding filter * for the reader. */ template<typename Alloc> bool encoded_reader_impl<Alloc>::filter (const char*& src_begin, const char* src_end, char*& dest_begin, char* dest_end, bool /* flush */) { int result = process(src_begin, src_end, dest_begin, dest_end, true); return result; } /** @brief Close the encoded reader. */ template<typename Alloc> void encoded_reader_impl<Alloc>::close() { // cannot re-open, not a true stream //eof_ = false; //reset(false, true); } } /* detail */ /** @brief Initializer for the symmetric write filter, which initializes * the iconv base from the parameters and the buffer size. */ template<typename Alloc> basic_encoded_writer<Alloc>::basic_encoded_writer (const encoded_params& p, int buffer_size): base_type(buffer_size, p) {} /** @brief Initializer for the symmetric read filter, which initializes * the iconv base from the parameters and the buffer size. */ template<typename Alloc> basic_encoded_reader<Alloc>::basic_encoded_reader(const encoded_params &p, int buffer_size): base_type(buffer_size, p) {} } /* iostreams */ } /* boost */ #include <boost/config/abi_suffix.hpp> // Pops abi_suffix.hpp pragmas. #ifdef BOOST_MSVC # pragma warning(pop) #endif

encoding.cpp

 #include "encoding.hpp" #include <iconv.h> #include <algorithm> #include <cstring> #include <string> namespace boost { namespace iostreams { namespace detail { // CONSTANTS // --------- const size_t maxUnicodeWidth = 4; // DETAILS // ------- /** @brief Initialize the iconv converter with the source and * destination encoding. */ encoded_base::encoded_base(const encoded_params &params) { if (params.output != params.input) { conv = iconv_open(params.output.data(), params.input.data()); differentCharset = true; } else { differentCharset = false; } } /** @brief Cleanup the iconv converter. */ encoded_base::~encoded_base() { if (differentCharset) { iconv_close(conv); } } /** C-style stream converter, which converts the source * character array to the destination character array, calling iconv * recursively to skip invalid characters. */ int encoded_base::convert(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end) { char *end = dest_end - maxUnicodeWidth; size_t srclen, dstlen; while (src_begin < src_end && dest_begin < end) { srclen = src_end - src_begin; dstlen = dest_end - dest_begin; char *pIn = const_cast<char *>(src_begin); iconv(conv, &pIn, &srclen, &dest_begin, &dstlen); if (src_begin == pIn) { src_begin++; } else { src_begin = pIn; } } return 0; } /** C-style stream converter, which copies source bytes to output * bytes. */ int encoded_base::copy(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end) { size_t srclen = src_end - src_begin; size_t dstlen = dest_end - dest_begin; size_t length = std::min(srclen, dstlen); memmove((void*) dest_begin, (void *) src_begin, length); src_begin += length; dest_begin += length; return 0; } /** @brief Processes the input stream through the stream filter. */ int encoded_base::process(const char * & src_begin, const char * & src_end, char * & dest_begin, char * & dest_end, int /* flushLevel */) { if (differentCharset) { return convert(src_begin, src_end, dest_begin, dest_end); } else { return copy(src_begin, src_end, dest_begin, dest_end); } } } /* detail */ } /* iostreams */ } /* boost */

Program example

 #include "encoding.hpp" #include <boost/iostreams/filtering_streambuf.hpp> #include <fstream> #include <string> int main() { std::ifstream fin("utf8.csv", std::ios::binary); std::ofstream fout("utf16le.csv", std::ios::binary); // encoding boost::iostreams::filtering_streambuf<boost::iostreams::input> streambuf; streambuf.push(boost::iostreams::encoded_reader({"UTF-8", "UTF-16LE"})); streambuf.push(fin); std::istream stream(&streambuf); std::string line; while (std::getline(stream, line)) { fout << line << std::endl; } fout.close(); }

In the above example, we write a copy of the UTF-8 encoded file to UTF-16LE, using streambuffer to convert the text of UTF-8 to UTF-16LE, which we write as bytes for output, only adding 4 lines of (readable) code for all of our process.

UTF-16 truncated read in C ++

More articles: