Retrieving files with invalid characters in a file name using Python

Question

Retrieving files with invalid characters in a file name using Python

I use the python zipfile module to extract the zip archive (let’s take this file from http://img.dafont.com/dl/?f=akvaleir , for example.)

f = zipfile.ZipFile('akvaleir.zip', 'r') for fileinfo in f.infolist(): print fileinfo.filename f.extract(fileinfo, '.')

His conclusion:

 Akval ir_Normal_v2007.ttf Akval ir, La police - The Font - Fr - En.pdf

Both files are unavailable after extraction, because the file names contain invalid encoded characters. The problem in the zipfile module is not able to specify the names of the output files.

However, "unzip akvaleir.zip" successfully escapes the file name:

 root@host :~# unzip akvaleir.zip Archive: akvaleir.zip inflating: Akvalir_Normal_v2007.ttf inflating: Akvalir, La police - The Font - Fr - En.pdf

I tried to capture the output of "unzip -l akvaleir.zip" in my python program, and these two file names:

 Akval\xd0\x92ir_Normal_v2007.ttf Akval\xd0\x92ir, La police - The Font - Fr - En.pdf

How can I get the correct file name, like what the unzip command does without capturing the output of "unzip -l akvaleir.zip"?

+3

python encoding filenames unicode zipfile

jack Nov 27 '09 at 6:15

source share

3 answers

It took some time, but I think I found the answer.

I suggested that this word should be Akvaléir. I found a description of the page about this in French. When I used your code snippet, I had a line like

 >>> fileinfo.filename 'Akval\x82ir, La police - The Font - Fr - En.pdf' >>>

This does not work with UTF8, Latin-1, CP-1251 or CP-1252 encodings. Then I found that CP863 is a possible Canadian encoding, so maybe it was from French Canada.

 >>> print unicode(fileinfo.filename, "cp863").encode("utf8") Akvaléir, La police - The Font - Fr - En.pdf >>>

However, I then read the Zip file format specification , which says

The ZIP format has historically been supported only by the original IBM PC character set, commonly called the IBM Code Page 437.
...
If bit 11 is used for general purposes, the file name and comment must support Unicode Standard, version 4.1.0 or greater, using the character encoding form specified by the UTF-8 repository specification.

Testing this question gives me the same answer as the Canadian codepage

 >>> print unicode(fileinfo.filename, "cp437").encode("utf8") Akvaléir, La police - The Font - Fr - En.pdf >>>

I don’t have a Unicode encoded zip file, and I'm not going to create it, so I just assume that all zip files are cp437 encoded.

 import shutil import zipfile f = zipfile.ZipFile('akvaleir.zip', 'r') for fileinfo in f.infolist(): filename = unicode(fileinfo.filename, "cp437") outputfile = open(filename, "wb") shutil.copyfileobj(f.open(fileinfo.filename), outputfile)

On my Mac, which gives

  109936 Nov 27 01:46 Akvale??ir_Normal_v2007.ttf 25244 Nov 27 01:46 Akvale??ir, La police - The Font - Fr - En.pdf

which completes the tab

 ls Akvale\314\201ir

and displayed with a good “é” in my file browser.

+8

Andrew Dalke Nov 27 '09 at 9:49

source share

I had a similar problem when starting my application using Docker. By adding these lines to the Dockerfile, everything fixed for me:

 RUN locale-gen en_US.UTF-8 ENV LANG en_US.UTF-8 ENV LANGUAGE en_US:en ENV LC_ALL en_US.UTF-8

So, I think, if you are not using Docker, try and make sure that the locales are correctly generated and installed.

0

Yoanis gil Jan 30 '17 at 2:02

source share

Alex martelli · Accepted Answer · 2009-11-27T06:33:02+0000

Instead of the extract method, use the open method and save the resulting pseudo-file to disk under any name you want, for example with shutil.copyfileobj .

Retrieving files with invalid characters in a file name using Python

More articles: