Python, UnicodeDecodeError

I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128) 

I tried installing many different codecs (in the header, for example # -*- coding: utf8 -*- ), or even using u "string", but it still appears.

How to fix it?

Edit: I don't know the actual character that causes this, but since this is a program that recursively looks at folders, it should find a file with strange characters in its name

the code:

 # -*- coding: utf8 -*- # by TerabyteST ########################### # Explores given path recursively # and finds file which size is bigger than the set treshold import sys import os class Explore(): def __init__(self): self._filelist = [] def exploreRec(self, folder, treshold): print folder generator = os.walk(folder + "/") try: content = generator.next() except: return folders = content[1] files = content[2] for n in folders: if "$" in n: folders.remove(n) for f in folders: self.exploreRec(u"%s/%s"%(folder, f), treshold) for f in files: try: rawsize = os.path.getsize(u"%s/%s"%(folder, f)) except: print "Error reading file %s"%u"%s/%s"%(folder, f) continue mbsize = rawsize / (1024 * 1024.0) if mbsize >= treshold: print "File %s is %d MBs!"%(u"%s/%s"%(folder, f), mbsize) 

Mistake:

 Traceback (most recent call last): File "<pyshell#19>", line 1, in <module> a.exploreRec("C:", 100) File "D:/Python/Explorator/shitfinder.py", line 35, in exploreRec print "Error reading file %s"%u"%s/%s"%(folder, f) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128) 

This is what print repr("Error reading file %s"%u"%s/%s"%(folder.decode('utf-8','ignore'), f.decode('utf-8','ignore')))

 >>> a = Explore() >>> a.exploreRec("C:", 100) File C:/Program Files/Ableton/Live 8.0.4/Resources/DefaultPackages/Live8Library_v8.2.alp is 258 MBs! File C:/Program Files/Adobe/Reader 9.0/Setup Files/{AC76BA86-7AD7-1040-7B44-A90000000001}/Data1.cab is 114 MBs! File C:/Program Files/Microsoft Games/Age of Empires III/art/Art1.bar is 393 MBs! File C:/Program Files/Microsoft Games/Age of Empires III/art/art2.bar is 396 MBs! File C:/Program Files/Microsoft Games/Age of Empires III/art/art3.bar is 228 MBs! File C:/Program Files/Microsoft Games/Age of Empires III/Sound/Sound.bar is 273 MBs! File C:/ProgramData/Microsoft/Search/Data/Applications/Windows/Windows.edb is 162 MBs! REPR: u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror Edge.lnk" END REPR: Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror Edge.lnk REPR: u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror Edge.lnk" END REPR: Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror Edge.lnk REPR: u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk' END REPR: Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk REPR: u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk' END REPR: Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk REPR: u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk' END REPR: Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk REPR: u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk' END REPR: Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk Traceback (most recent call last): File "<pyshell#21>", line 1, in <module> a.exploreRec("C:", 100) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold) UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 78: ordinal not in range(128) >>> 
+7
python unicode
source share
8 answers

We cannot guess what you are trying to do, not what is in your code, not what it means to “install many different codecs,” nor what the string should do for you.

Please change your code to its original state so that it reflects as best as possible what you are trying to do, run it again, and then edit your question to provide (1) the full trace and error message you receive (2 ) a snippet covering the last statement of your script that appears in traceback (3), a brief description of what you want to do for the code (4), which version of Python you are using.

Edit after adding details to the question:

(0) Let's try some transformations in an unsuccessful expression:

Original:
print "Error reading file %s"%u"%s/%s"%(folder, f)
Add spaces to reduce illegibility:
print "Error reading file %s" % u"%s/%s" % (folder, f)
Add parentheses to emphasize the order of evaluation:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
Calculate expression (constant) in parentheses:
print u"Error reading file %s/%s" % (folder, f)

Is this really what you intended? Suggestion: Build the ONCE path using the best method (see Clause (2) below).

(1) In general, use repr(foo) or "%r" % foo for diagnostics. Thus, your diagnostic code is much less likely to throw an exception (as it happens here), and you avoid ambiguity. Insert the print repr(folder), repr(f) statement before trying to get the size, repeat, and send the report.

(2) Do not create paths using u"%s/%s" % (folder, filename) ... use os.path.join(folder, filename)

(3) You have no bare exceptions, check for known issues. To prevent unknown problems from being unknown, do the following:

 try: some_code() except ReasonForBaleOutError: continue except: # something gone wrong, so get diagnostic info print repr(interesting_datum_1), repr(interesting_datum_2) # ... and get traceback and error message raise 

A more complicated way is to log instead of printing, but higher is much better than not knowing what is going on.

Further changes after rtm ("os.walk"), remembering old legends and re-reading your code:

(4) os.walk () moves throughout the tree; you do not need to call it recursively.

(5) If you pass the unicode string to os.walk (), the results (paths, file names) are reported as unicode. You don’t need everything that “blah” has. Then you just need to choose how you display the results in Unicode.

(6) Removing paths using "$" in them: you must change the list in place, but your method is dangerous. Try something like this:

 for i in xrange(len(folders), -1, -1): if '$' in folders[i]: del folders[i] 

(7) Access the files by entering the folder name and file name. You are using the folder name ORIGINAL; when you pull out the recursion, this will not work; you need to use the current content[0] value specified by os.walk.

(8) You should find yourself something very simple:

 for folder, subfolders, filenames in os.walk(unicoded_top_folder): 

There is no need for generator = os.walk(...); try: content = generator.next() generator = os.walk(...); try: content = generator.next() , etc. and BTW, if you ever need generator.next() in the future, use except StopIteration instead of bare except.

(9) If the caller provides a non-existent folder, an exception is not thrown, it just does nothing. If the provided folder exists but is empty, the same thing. If you need to make a distinction between these two scenarios, you will need to conduct additional testing yourself.

Reply to this comment from OP: "" Thank you, please read info repr () shown in the first post. I don’t know why he printed so many different items, but it seems like everyone has problems. And between them, the whole point is that these are .ink files. Maybe this is a problem? Also, in the latest, firefox, it prints (Modalitrovvisoria), and the real file name from Explorer contains (Modalità provvisoria) ""

(10) Umm, that is not ".INK" .lower (), it is ".LNK" .lower () ... maybe you need to change the font in everything you read with.

(11) The fact that the names of the "problematic" names end with ".lnk" / may / be something related to os.walk () and / or Windows does something special with the names of these files.

(12) I repeat here the Python statement that you used to generate this output, with a space entered:

 print repr( "Error reading file %s" \ % u"%s/%s" % ( folder.decode('utf-8','ignore'), f.decode('utf-8','ignore') ) ) 

It seems that you didn’t read or understand or simply ignored the advice I gave you in the comment to another answer (and the respondent’s answer): UTF-8 does NOT matter in the context of the file names in the Windows file system.

We are interested in exactly what folder and f belongs to. You trampled on all the evidence trying to decrypt it using UTF-8. You have exacerbated obfuscation by using the ignore option. If you used the "replace" option, you would see "(Modalit \ ufffdrovvisoria)". The ignore option does not have a place in debugging.

In any case, it is a suspicious fact that some of the file names had some kind of error, but there appeared NOT to lose characters with the option "ignore" (or, it seems, were NOT crippled).

Which part of the "" Insert instruction print repr(folder), repr(f) "" "Don't you understand? All you have to do is something like this:

 print "Some meaningful text" # "error reading file" isn't print "folder:", repr(folder) print "f:", repr(f) 

(13) It also seems that you specified UTF-8 elsewhere in your code, judging by the trace: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

I would like to point out that you still do not know if folders and f refer to str objects or unicode objects, and the two answers suggested that they are most likely str objects, so why introduce blahbah.encode () ??

More general point. Try to understand what your problems are or problems BEFORE replacing the script. Thrashing about trying every sentence combined with virtually zero effective debugging technology is not the way forward.

(14) When you run your script again, you may need to reduce the amount of output by running it over some subset of C: \ ... especially if you start my initial sentence in order to have debug print from ALL file names, not just erroneous ones (knowing what errors do not look like) may help in understanding the problem).

Response to Brian McLemore's “cleanup” function:

(15) Here's an annotated interactive session that illustrates what actually happens with the os.walk () and non-ASCII file names:

 C:\junk\terabytest>dir [snip] Directory of C:\junk\terabytest 20/11/2009 01:28 PM <DIR> . 20/11/2009 01:28 PM <DIR> .. 20/11/2009 11:48 AM <DIR> empty 20/11/2009 01:26 PM 11 Hašek.txt 20/11/2009 01:31 PM 1,419 tbyte1.py 29/12/2007 09:33 AM 9 Ð.txt 3 File(s) 1,439 bytes [snip] C:\junk\terabytest>\python26\python Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32 Type "help", "copyright", "credits" or "license" for more information. >>> from pprint import pprint as pp >>> import os 

os.walk (unicode_string) -> leads to unicode objects

 >>> pp(list(os.walk(ur"c:\junk\terabytest"))) [(u'c:\\junk\\terabytest', [u'empty'], [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']), (u'c:\\junk\\terabytest\\empty', [], [])] 

os.walk (str_string) -> leads to str objects

 >>> pp(list(os.walk(r"c:\junk\terabytest"))) [('c:\\junk\\terabytest', ['empty'], ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']), ('c:\\junk\\terabytest\\empty', [], [])] 

cp1252 is the encoding I expect to use on my system ...

 >>> u'\u0161'.encode('cp1252') '\x9a' >>> 'Ha\x9aek'.decode('cp1252') u'Ha\u0161ek' 

decoding str using UTF-8 does not work as expected

 >>> 'Ha\x9aek'.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte 

ANY random byte string can be decoded without errors using latin1

 >>> 'Ha\x9aek'.decode('latin1') u'Ha\x9aek' 

BUT U + 009A - control character (INTRODUCTION OF A SINGLE CHARACTER), i.e. senseless gibberish; absolutely nothing to do with the correct answer

 >>> unicodedata.name(u'\u0161') 'LATIN SMALL LETTER S WITH CARON' >>> 

(16) This example shows what happens when a character is represented in the default character set; What happens if it is not? Here is an example (using IDLE this time) of a file name containing CJK ideograms that definitely cannot be represented in my default character set:

 IDLE 2.6.4 >>> import os >>> from pprint import pprint as pp 

repr (Unicode results) looks great

 >>> pp(list(os.walk(ur"c:\junk\terabytest\chinese"))) [(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])] 

and unicode only displays fine in IDLE:

 >>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0] nihao你好.txt 

The result of str is obviously produced using .encode (whatever “replace”) is not very useful, for example. you cannot open a file by passing it as a file name.

 >>> pp(list(os.walk(r"c:\junk\terabytest\chinese"))) [('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])] 

So, the conclusion is that for best results, you should pass the unicode string to os.walk () and solve any display problems.

+13
source share

Python uses ASCII encoding by default, which is annoying. If you want to constantly change it, find and edit the site.py file, find def setencoding() and a few lines below, change encoding = "ascii" to encoding = "utf-8" . So far, the default is ASCII encoding by default.

+6
source share

You are trying to perform some action (for example, printing) on ​​a Unicode string that contains non-ASCII characters, and the string is converted to ascii by default. You will need to specify the encoding in order to correctly represent the string.
This will help greatly if you post some sample code of what you are trying to do.

The easiest way to do this:
s = u'ma\xf1ana';
print s.encode('latin-1');

Edited after adding details to the question:

In your case, you need to decode the line that you read at the beginning:
f.decode(); ,
so try changing
u"%s/%s" % (folder, f)
to
os.path.join(folder, f.decode())

Please note that to change the file name with the name

may require "latin-1" encoding,

PS: John Machin mentioned very useful ways to improve and clean the code. +1

+2
source share

Do you run this program in cmd.exe windows? If so, try running it in IDLE and see if you have the same errors. In the Cmd.exe field, unicode is not executed, but only ascii.

+1
source share

Some elements of unicode:

  • put # encoding: utf-8 at the top of the file sometimes helps (if your editor uses UTF-8 to save your file ...)
  • s = "i'm a string"
  • u = u"i'm unicode, at least in python < ۳"
  • If your work with files tries to look into the codecs module.

Further readings:

+1
source share
 u"%s" % f 

In different places you are doing something similar to the above code. This is just the wrong way to convert a str object to a unicode object, because the conversion is done using sys.getdefaultencoding () (ascii), which is almost guaranteed.

You must use encoding / decoding methods to convert to / from a unicode object. This requires knowing what encodes your input (strings returned from os.walk). For example, if file names are encoded in UTF-8

 uf = f.decode('utf-8') 

interprets f as a UTF-8 encoded sequence of bytes and returns the corresponding unicode object. Similarly, when you need to output a unicode object, you must convert it back to str, specifying a valid encoding that you want to output as.

 print uf.encode('utf-8') 
+1
source share

I had the misfortune to work in some codes that did not match their encoding.

This is the function we used to clean it:

 def to_unicode(value): if isinstance(value, unicode): return value elif isinstance(value, str): try: if value.startswith('\xff\xfe'): return value.decode('utf-16-le') elif value.startswith('\xfe\xff'): return value.decode('utf-16-be') else: return value.decode('utf-8') except UnicodeDecodeError: return value.decode('latin-1') else: try: return unicode(value) except UnicodeError: return to_unicode(str(value)) except TypeError: if hasattr(value, '__unicode__'): return value.__unicode__() 

Thus, using this function, you can use:

 print u"Error reading file %s/%s" % (to_unicode(folder), to_unicode(f)) 
0
source share

instead of doing:

 print "Error reading file %s"%u"%s/%s"%(folder, f) 

Try the following:

 print "Error reading file %s"%u"%s/%s"%(folder.encode('ascii','ignore'), f.encode('ascii','ignore')) 

Since the console cannot print Unicode characters, you can see the correct name. "ignore" tells the codec to skip these characters. you can also use 'replace' (prints '?'), 'xmlcharrefreplace' (replaces the code point with & x ####), 'backslashreplace' (replaces the code with \ x ######)

You will need to encode each Unicode string as you type it.

-one
source share

All Articles