Set UTF-8 header header in libarchive

ESSENCE

How to write a zip file using libarchive in C ++, so that the path names will be encoded in UTF-8? With UTF-8 path names, special characters will be correctly decoded when using OS X / Linux / Windows 8/7-Zip / WinZip.

DETAILS

I am trying to write a zip archive using libarchive, compiling with Visual C ++ 2013 on Windows.

I would like to be able to add files with non-ASCII characters (for example, äöü.txt) to the zip archive.

There are four functions for setting the path header in libarchive:

void archive_entry_set_pathname(struct archive_entry *, const char *); void archive_entry_copy_pathname(struct archive_entry *, const char *); void archive_entry_copy_pathname_w(struct archive_entry *, const wchar_t *); int archive_entry_update_pathname_utf8(struct archive_entry *, const char *); 

Unfortunately, none of them seem to be working.

In particular, I tried:

 const char* myUtf8Str = ... archive_entry_update_pathname_utf8(entry, myUtf8Str); // this sounded like the most straightforward solution 

and

 const wchar_t* myUtf16Str = ... archive_entry_copy_pathname_w(entry, myUtf16Str); // UTF-16 encoded strings seem to be the default on Windows 

In both cases, the resulting zip archive does not show the file names correctly in both Windows Explorer and 7-Zip.

I am sure my input lines are encoded correctly, as I convert them from Qt QString instances that work fine in other parts of my code:

 const char* myUtf8Str = filename.toUtf8().constData(); const wchar_t* myUtf16Str = filename.toStdWString().c_str(); 

For example, this works even for another libarchive call when creating a zip file:

 archive_write_open_filename_w(archive, zipFile.toStdWString().c_str()); // creates a zip archive file where the non-ASCII // chars are encoded correctly, eg äöü.zip 

I also tried changing the libarchive options as suggested by this example :

 archive_write_set_options(a, "hdrcharset=UTF-8"); 

But this call fails, so I guess I need to install another option, but my ideas are running out ...

UPDATE 2

I read about the zip format. It allows you to write file names to UTF-8, so that OS X / Linux / Windows 8/7-Zip / WinZip will always decode them correctly, see, for example, here .

This is what I want to achieve with libarchive, i.e. I would like to pass its encoded UTF-8 pathname and save it in a zip file without any conversion.

I added the "set locale" approach as a (unsatisfactory) answer.

+7
c ++ utf-8 zip wstring libarchive
source share
3 answers

This is a temporary solution that will store the path names using the system’s language system settings, that is, the resulting zip file can be correctly decoded on the same system, but not transferred.

This is not satisfying, I am just posting it to show that this is not what I am looking for.

Set the global locale to "" as here :

 std::locale::global(std::locale("")); 

and then read it:

 std::locale loc; std::cout << loc.name() << std::endl; // output: English_United States.1252 // may of course be different depending on system settings 

Then set the pathname using archive_entry_update_pathname_utf8 .

The zip file now contains file names encoded using Windows-1252, so my Windows can read them, but they look like garbage, for example. Linux

Future

There is a libarchive problem for UTF-8 file names. The whole story is rather complicated, but it looks like they can add better UTF-8 support in libarchive 4.0.

+2
source share

I will add this as an answer because it exceeds the text limits for the comment.

When the program starts, the global locale matches the classic locale. Classical C is the English language ASCII standard in the C standard library, which is implicitly used in programs that are not internationalized. And as this source suggests -

... If you plan to localize your program, a suitable strategy might be to restore the native locale once at the beginning of your program, and never, never change this setting again. Thus, your application adapts to one specific language and uses this throughout its work. Users of such applications can explicitly set their favorite language before starting an expression. On UNIX systems, they do this by setting environment variables such as LANG; other operating systems may use other methods.

In your program, you can specify that you want to use your preferred native language by calling std::setlocale("") at startup, passing an empty string as the locale name. An empty string tells setlocale to use the locale specified by the user in the environment.

0
source share

I got UTF-8 file names working in ZIP archives using libarchive-3.3.3 using this exact stream (sequence is important!):

 entry = archive_entry_new(); archive_entry_set_pathname_utf8(entry, utf8Filename); archive_entry_set_pathname(entry, utf8Filename); 

When you switch archive_entry_set_pathname_utf8 / archive_entry_set_pathname, the entries are distorted in the Windows Explorer ZIP function. This worked for me for the German umlauts (but should be suitable for every UTF-8 character). This even worked for 2-byte and 3-byte UTF-8 characters (NFC / NFD).

// Addition The process should be started in an environment with the LANG variable set to a locale with UTF-8 support (ie "LANG = de_DE.UTF-8" in my case). Without this environment, the process will not generate the correct UTF-8 characters.

0
source share

All Articles