C ++ unicode file io

I need an io file library that can give my program the utf-16 (little endian) interface, but can process files in other encodings, mainly ascii (input only), utf-8, utf-16, utf-32 / ucs4, including both small and large byte orders with bytes.

Looking around the only library I found was the ICU ustdio.h library.

I tried this, but I can’t even get it to work with a very simple piece of text, and in its use there is quite a lot of null documentation, but only the help page of the ICU file, in which there are no examples and very small details (for example, making UFILE from an existing FILE, it is safe to use other functions that accept FILE *? Along with several others ...).

Also, id is much more likely a C ++ library that can give me a wide-stream interface through a C-style interface ...

std::wstring str = L"Hello World in UTF-16!\nAnother line.\n"; UFILE *ufile = u_fopen("out2.txt", "w", 0, "utf-16"); u_file_write(str.c_str(), str.size(), ufile); u_fclose(ufile); 

Output

 Hello World in UTF-16!ΰ¨δ„€ζΈ€ζΌ€η€ζ €ζ”€ηˆ€ 氀怀渀攀⸀ഀ 

hex

 FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00 55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 0A 00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00 20 00 6C 00 69 00 6E 00 65 00 2E 00 0D 0A 00 

EDIT: The correct output of the windows will look like this:

 FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 20 00 69 00 6E 00 20 00 55 00 54 00 46 00 2D 00 31 00 36 00 21 00 0D 00 0A 00 41 00 6E 00 6F 00 74 00 68 00 65 00 72 00 20 00 6C 00 69 00 6E 00 65 00 2E 00 0D 00 0A 00 
+4
source share
5 answers

The problem you see is related to line feed conversion. Unfortunately, this is done at the byte level (after code conversion) and does not know about the encoding. IOW, you need to turn off automatic conversion (by opening the file in binary mode, with the β€œb” flag), and if you want 0A00 to be expanded to 0D00A00, you have to do it yourself.

You mentioned that you would prefer the C ++ widescreen interface, so I will describe what I did to achieve this in our software:

  • Enter the facies std :: codecvt using the UUonverter ICU to perform conversions.
  • Use std :: wfstream to open the file
  • imbue () your own codecvt in wfstream
  • Open wfstream with a binary flag to disable automatic (and erroneous) line feed conversion.
  • Type "WNewlineFilter" to perform string conversion to wchars. Use inspiration boost :: iostreams :: newline_filter
  • Use boost :: iostreams :: filtering_wstream to bind wfstream and WNewlineFilter together as a stream.
+4
source

I successfully worked with the EZUTF library hosted in CodeProject: High-performance Unicode text file I / O for C ++

+4
source

UTF8-CPP gives you the conversion between UTF-8, 16 and 32. A very nice and lightweight library.

About ICU, some comments from the creator of UTF8-CPP:

ICU library. It is very powerful, complete, multi-functional, mature and widely used. Also large, annoying, not generic, and doesn't play well with the standard library. I definitely recommend looking at the ICU even if you do not plan to use it.

:)

+2
source

I think problems arise from 0D 0A 00 linebreaks. You can try if other lines, such as \r\n or using only LF or CR, work (it would be best to use \r , I suppose)

EDIT: It seems that 0D 00 0A 00 is what you want, so you can try

 std::wstring str = L"Hello World in UTF-16!\15\12Another line.\15\12"; 
+1
source

You can try the iconv library ( libiconv ).

+1
source

All Articles