UnicodeEncodeError when attaching a file name

It throws "UnicodeDecodeError: the ascii codec cannot decode the 0xc2 byte at position 2: the serial number is not in the range (128)" when executing the following code:

filename = 'Spywaj.ttf' print repr(filename) >> 'Sp\xc2\x88ywaj.ttf' filepath = os.path.join('/dirname', filename) 

But the file is valid and exists on disk. The file name was extracted from the "unzip -l" command. How to join such names?

OS and file system

 Filesystem: ext3 relatime,errors=remount-ro 0 0 Locale: en_US.UTF-8 

Alex os.path.join's suggestion is working now, but I still canโ€™t access the file on disk with the name of the file with which it joined.

 filename = filename.decode('utf-8') filepath = os.path.join('/dirname', filename) print filepath >> /dirname/u'Sp\xc2\x88ywaj.ttf' print os.path.isfile(filepath) >> False new_filepath = filepath.encode('Latin-1').encode('utf-8') print new_filepath >> /dirname/u'Sp\xc2\x88ywaj.ttf' print type(filepath) >> <type 'unicode'> print os.path.isfile(new_filepath) >> False valid_filepath = glob.glob('/dirname/*.ttf')[0] print valid_filepath >> /dirname/Spywaj.ttf (SO cannot display the chars in filename) print type(valid_filepath) >> <type 'str'> print os.path.isfile(valid_filepath) >> True 
+7
python filenames unicode
source share
4 answers

In both Latin-1 (ISO-8859-1) and Windows-1252, 0xc2 will have capital A with a rounded accent ... it seems nowhere in the code that you show! Could you add

 print repr(filename) 

before calling os.path.join (and also put '/dirname' in the variable and print its version for completeness?). I think that perhaps this wandering symbol is , but for some reason you do not see it - repr will show it.

If you have a non-Ascii Latin-1 (or Win-1252) character in your file name, you should use Unicode - and / or, depending on your OS and file system, some specific encoding.

Edit : OP confirms, thanks to repr , that there are actually two bytes that cannot be ASCII - 0xc2, and then 0x88, which corresponds to what OP considers to be one lowercase L. Well, this sequence will be a Unicode capital letter A with a carriage (codepoint 0x88) in the fairly popular UTF-8 encoding - how it might look like a lowercase L to explain the OP beggar, but I guess some fonts can be crazy enough to allow such confusion.

Therefore, I will first try filename = filename.decode('utf-8') - this should allow os.path.join to work. If open then intercepts the resulting Unicode string (it may work, depending on the file system and OS), the next attempt is to try to use this Unicode object .encode('Latin-1') and .encode('utf-8') . If none of the encodings works, information about the operating system and the file system used, which, it seems to me, has not yet been confirmed by the OP, becomes critical.

+8
source share

I fixed UnicodeDecodeError by adding these lines to /etc/apache2/envvars and restarting Apache.

 export LANG='en_US.UTF-8' export LC_ALL='en_US.UTF-8' 

as described here: https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/modwsgi/#if-you-get-a-unicodeencodeerror

I spent some time debugging.

+6
source share
 filename = filename.decode('utf-8').encode("latin-1") 

works for me with the Splywaj.zip file

 >>> os.path.isfile(filename.decode("utf8").encode("latin-1")) True >>> 
+2
source share

Problem of Evidence 1 ###

It throws "UnicodeDecodeError: the ascii codec cannot decode the 0xc2 byte at position 2: the serial number is not in the range (128)" when executing the following code:

 filename = 'Spywaj.ttf' print repr(filename) >> 'Sp\xc2\x88ywaj.ttf' filepath = os.path.join('/dirname', filename) 

I donโ€™t see how to get this exception - both os.path.join arguments are str objects. It makes no sense to try to convert anything to unicode. Are you sure the code above is exactly what you ran?

Problem of Evidence 2

Alex os.path.join's suggestion is working now, but I still canโ€™t access the file on disk with the name of the file that it joined.

 filename = filename.decode('utf-8') filepath = os.path.join('/dirname', filename) print filepath >> /dirname/u'Sp\xc2\x88ywaj.ttf' 

Sorry, assuming filename not changed from the previous snippet, this is definitely not possible. This is similar to the result of os.path.join('/dirname', repr(filename)) ... please make sure to publish the code that you were actually executing along with the actual result (and the actual trace, if any )

Confusion

 new_filepath = filepath.encode('Latin-1').encode('utf-8') 

Alex wanted to try twice, each time with one of these encodings - do not try once with both encodings! Since all characters in the file path are in the ASCII range (see Proof of Problem 2), the effect was just filepath.encode ('ascii')

A simple solution

You know how to find the file name that interests you:

 valid_filepath = glob.glob('/dirname/*.ttf')[0] 

If you need to write this name hard in a script, you can use the repr () function to get a view that you can enter into your script without worrying about utf8, unicode, encode, decode and all this noise:

 print repr(valid_filepath) 

Suppose it prints '/dirname/Sp\xc2\x88ywaj.ttf' ... then all you have to do is carefully copy and paste it into your script:

 file_path = '/dirname/Sp\xc2\x88ywaj.ttf' 
0
source share

All Articles