We cannot guess what you are trying to do, not what is in your code, not what it means to “install many different codecs,” nor what the string should do for you.
Please change your code to its original state so that it reflects as best as possible what you are trying to do, run it again, and then edit your question to provide (1) the full trace and error message you receive (2 ) a snippet covering the last statement of your script that appears in traceback (3), a brief description of what you want to do for the code (4), which version of Python you are using.
Edit after adding details to the question:
(0) Let's try some transformations in an unsuccessful expression:
Original:
print "Error reading file %s"%u"%s/%s"%(folder, f)
Add spaces to reduce illegibility:
print "Error reading file %s" % u"%s/%s" % (folder, f)
Add parentheses to emphasize the order of evaluation:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
Calculate expression (constant) in parentheses:
print u"Error reading file %s/%s" % (folder, f)
Is this really what you intended? Suggestion: Build the ONCE path using the best method (see Clause (2) below).
(1) In general, use repr(foo) or "%r" % foo for diagnostics. Thus, your diagnostic code is much less likely to throw an exception (as it happens here), and you avoid ambiguity. Insert the print repr(folder), repr(f) statement before trying to get the size, repeat, and send the report.
(2) Do not create paths using u"%s/%s" % (folder, filename) ... use os.path.join(folder, filename)
(3) You have no bare exceptions, check for known issues. To prevent unknown problems from being unknown, do the following:
try: some_code() except ReasonForBaleOutError: continue except:
A more complicated way is to log instead of printing, but higher is much better than not knowing what is going on.
Further changes after rtm ("os.walk"), remembering old legends and re-reading your code:
(4) os.walk () moves throughout the tree; you do not need to call it recursively.
(5) If you pass the unicode string to os.walk (), the results (paths, file names) are reported as unicode. You don’t need everything that “blah” has. Then you just need to choose how you display the results in Unicode.
(6) Removing paths using "$" in them: you must change the list in place, but your method is dangerous. Try something like this:
for i in xrange(len(folders), -1, -1): if '$' in folders[i]: del folders[i]
(7) Access the files by entering the folder name and file name. You are using the folder name ORIGINAL; when you pull out the recursion, this will not work; you need to use the current content[0] value specified by os.walk.
(8) You should find yourself something very simple:
for folder, subfolders, filenames in os.walk(unicoded_top_folder):
There is no need for generator = os.walk(...); try: content = generator.next() generator = os.walk(...); try: content = generator.next() , etc. and BTW, if you ever need generator.next() in the future, use except StopIteration instead of bare except.
(9) If the caller provides a non-existent folder, an exception is not thrown, it just does nothing. If the provided folder exists but is empty, the same thing. If you need to make a distinction between these two scenarios, you will need to conduct additional testing yourself.
Reply to this comment from OP: "" Thank you, please read info repr () shown in the first post. I don’t know why he printed so many different items, but it seems like everyone has problems. And between them, the whole point is that these are .ink files. Maybe this is a problem? Also, in the latest, firefox, it prints (Modalitrovvisoria), and the real file name from Explorer contains (Modalità provvisoria) ""
(10) Umm, that is not ".INK" .lower (), it is ".LNK" .lower () ... maybe you need to change the font in everything you read with.
(11) The fact that the names of the "problematic" names end with ".lnk" / may / be something related to os.walk () and / or Windows does something special with the names of these files.
(12) I repeat here the Python statement that you used to generate this output, with a space entered:
print repr( "Error reading file %s" \ % u"%s/%s" % ( folder.decode('utf-8','ignore'), f.decode('utf-8','ignore') ) )
It seems that you didn’t read or understand or simply ignored the advice I gave you in the comment to another answer (and the respondent’s answer): UTF-8 does NOT matter in the context of the file names in the Windows file system.
We are interested in exactly what folder and f belongs to. You trampled on all the evidence trying to decrypt it using UTF-8. You have exacerbated obfuscation by using the ignore option. If you used the "replace" option, you would see "(Modalit \ ufffdrovvisoria)". The ignore option does not have a place in debugging.
In any case, it is a suspicious fact that some of the file names had some kind of error, but there appeared NOT to lose characters with the option "ignore" (or, it seems, were NOT crippled).
Which part of the "" Insert instruction print repr(folder), repr(f) "" "Don't you understand? All you have to do is something like this:
print "Some meaningful text" # "error reading file" isn't print "folder:", repr(folder) print "f:", repr(f)
(13) It also seems that you specified UTF-8 elsewhere in your code, judging by the trace: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
I would like to point out that you still do not know if folders and f refer to str objects or unicode objects, and the two answers suggested that they are most likely str objects, so why introduce blahbah.encode () ??
More general point. Try to understand what your problems are or problems BEFORE replacing the script. Thrashing about trying every sentence combined with virtually zero effective debugging technology is not the way forward.
(14) When you run your script again, you may need to reduce the amount of output by running it over some subset of C: \ ... especially if you start my initial sentence in order to have debug print from ALL file names, not just erroneous ones (knowing what errors do not look like) may help in understanding the problem).
Response to Brian McLemore's “cleanup” function:
(15) Here's an annotated interactive session that illustrates what actually happens with the os.walk () and non-ASCII file names:
C:\junk\terabytest>dir [snip] Directory of C:\junk\terabytest 20/11/2009 01:28 PM <DIR> . 20/11/2009 01:28 PM <DIR> .. 20/11/2009 11:48 AM <DIR> empty 20/11/2009 01:26 PM 11 Hašek.txt 20/11/2009 01:31 PM 1,419 tbyte1.py 29/12/2007 09:33 AM 9 Ð.txt 3 File(s) 1,439 bytes [snip] C:\junk\terabytest>\python26\python Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32 Type "help", "copyright", "credits" or "license" for more information. >>> from pprint import pprint as pp >>> import os
os.walk (unicode_string) -> leads to unicode objects
>>> pp(list(os.walk(ur"c:\junk\terabytest"))) [(u'c:\\junk\\terabytest', [u'empty'], [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']), (u'c:\\junk\\terabytest\\empty', [], [])]
os.walk (str_string) -> leads to str objects
>>> pp(list(os.walk(r"c:\junk\terabytest"))) [('c:\\junk\\terabytest', ['empty'], ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']), ('c:\\junk\\terabytest\\empty', [], [])]
cp1252 is the encoding I expect to use on my system ...
>>> u'\u0161'.encode('cp1252') '\x9a' >>> 'Ha\x9aek'.decode('cp1252') u'Ha\u0161ek'
decoding str using UTF-8 does not work as expected
>>> 'Ha\x9aek'.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte
ANY random byte string can be decoded without errors using latin1
>>> 'Ha\x9aek'.decode('latin1') u'Ha\x9aek'
BUT U + 009A - control character (INTRODUCTION OF A SINGLE CHARACTER), i.e. senseless gibberish; absolutely nothing to do with the correct answer
>>> unicodedata.name(u'\u0161') 'LATIN SMALL LETTER S WITH CARON' >>>
(16) This example shows what happens when a character is represented in the default character set; What happens if it is not? Here is an example (using IDLE this time) of a file name containing CJK ideograms that definitely cannot be represented in my default character set:
IDLE 2.6.4 >>> import os >>> from pprint import pprint as pp
repr (Unicode results) looks great
>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese"))) [(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]
and unicode only displays fine in IDLE:
>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0] nihao你好.txt
The result of str is obviously produced using .encode (whatever “replace”) is not very useful, for example. you cannot open a file by passing it as a file name.
>>> pp(list(os.walk(r"c:\junk\terabytest\chinese"))) [('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]
So, the conclusion is that for best results, you should pass the unicode string to os.walk () and solve any display problems.