Correctly decode zip record file names - CP437, UTF-8 or?

Question

Correctly decode zip record file names - CP437, UTF-8 or?

I recently wrote a zip file I / O library called zipzap , but I'm struggling to correctly decode zip record file names from arbitrary zip files.

Now PKWARE spec declares:

D.1 The ZIP format historically only supports the original IBM PC character, a set of encodings, commonly called the IBM Code Page 437 ...
D.2 If bit 11 of general purpose is not set, the file name and comment shall be consistent with the original ZIP character encoding. If bit 11 is used for general purposes, filename and comment must support the Unicode standard, version 4.1.0 or greater, using the character encoding form defined by the UTF-8 repository specification ...

which means that the corresponding zip files encode the file names as CP437 if the EFS bit is not set, in which case the UTF-8 file names.

Unfortunately, it seems that many zip tools either do not set the EFS bit correctly (e.g. Mac CLI, zip GUI), or use some other encoding, usually a standard system (e.g. WinZip?). If you know like WinZip, 7-Zip, Info-Zip, PKZIP, Java JAR / Zip, .NET zip, dotnetzip, etc. Encode the file names and what they set in the "version made by" field, when you pinch, please say.

In particular, Info-Zip tries to do this when unpacking:

File System = MS-DOS (0) => CP437
- except: version = 2.5, 2.6, 4.0 => ISO 8859-1
File System = HPFS (6) => CP437
File System = NTFS (10) and Version = 5.0 => CP437
otherwise, ISO 8859-1

If I want to support checking or extracting from arbitrary zip files and make a reasonable attempt to encode a file name without an EFS flag, what can I look for?

+7

jar zip zipfile 7zip winzip

Glen low Nov 07 '12 at 0:01

source share

2 answers

Currently the situation is as follows:

most Windows implementations use DOS encoding (OEM)
The zip utility for Mac OS uses utf-8 but does not set the utf-8 bit flags
* Nix zip utilities silently use system coding

Thus, the only way to check whether the file name contains something like utf-8 characters (check the utf8 encoding description - the first byte should be 110xxxxx, the second - 10xxxxxx for 2-byte encoded characters). If it is correct utf8 string - use utf8 encoding. If not, return to OEM / DOS encoding.

+2

Nickolay Olshevsky Nov 11 '12 at 12:09

source share

Nathan moinvaziri · Accepted Answer · 2012-11-07T00:31:46+0000

The only way to determine if a file name is encoded as UTF-8 without using the EFS flag is to check if a high order bit is set to one of the characters. This may mean that the character is UTF-8 encoded. However, it could still be the case, since the CP437 has some characters that have a high order bit set and are not designed to decode as UTF-8.

I would stick to the PKWARE application note specification and not crack a solution that tries to fit all known zip applications.

Correctly decode zip record file names - CP437, UTF-8 or?

More articles: