I recently wrote a zip file I / O library called zipzap , but I'm struggling to correctly decode zip record file names from arbitrary zip files.
Now PKWARE spec declares:
D.1 The ZIP format historically only supports the original IBM PC character, a set of encodings, commonly called the IBM Code Page 437 ...
D.2 If bit 11 of general purpose is not set, the file name and comment shall be consistent with the original ZIP character encoding. If bit 11 is used for general purposes, filename and comment must support the Unicode standard, version 4.1.0 or greater, using the character encoding form defined by the UTF-8 repository specification ...
which means that the corresponding zip files encode the file names as CP437 if the EFS bit is not set, in which case the UTF-8 file names.
Unfortunately, it seems that many zip tools either do not set the EFS bit correctly (e.g. Mac CLI, zip GUI), or use some other encoding, usually a standard system (e.g. WinZip?). If you know like WinZip, 7-Zip, Info-Zip, PKZIP, Java JAR / Zip, .NET zip, dotnetzip, etc. Encode the file names and what they set in the "version made by" field, when you pinch, please say.
In particular, Info-Zip tries to do this when unpacking:
- File System = MS-DOS (0) => CP437
- except: version = 2.5, 2.6, 4.0 => ISO 8859-1
- File System = HPFS (6) => CP437
- File System = NTFS (10) and Version = 5.0 => CP437
- otherwise, ISO 8859-1
If I want to support checking or extracting from arbitrary zip files and make a reasonable attempt to encode a file name without an EFS flag, what can I look for?
Glen low
source share