Python UTF file name handling

I've already read a lot on this topic, including what seems to be the definitive guide to this topic: http://docs.python.org/howto/unicode.html

Perhaps for a more experienced developer this guide might be enough. However, in my case, I am more confused than when I started and still have not solved my problem.

I am trying to read the file names using os.walk () and get certain information about the files (e.g. fileize) before writing this information to a text file. This works until I come across files with utf encoded names. When it gets into a file with the name encoded in utf, I get an error similar to this:

WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'Documents\\??.txt' 

In this case, the file was named 唽 咿 .txt.

Here's how I tried to do it so far:

 for (root, dirs, files) in os.walk(dirpath): for filename in files: filepath = os.path.join(root, filename) filesize = os.stat(filepath).st_size file = open(filepath, 'rb') stuff = get_stuff(filesize, file) file.close() 

In case this matters, dirpath comes from the earlier part of the code, which is equal to "dirpath = raw_input ()".

I tried various things, such as changing the file path line:

 filepath = unicode(os.path.join(unicode(root), unicode(filename))) 

But nothing that I tried did not work.

Here are my two questions:

  • How can I make this pass the correct file name to the os.stat () method so that I can get the correct answer from it?

  • My script should write some file names to a text file that you might need to read later. At this point, he should be able to find the file based on what he just read from the text file. How to write such file names to a text file, and then read them later?

+4
source share
2 answers

Skip the unicode path to os.walk() .

Changed in version 2.3: On Windows NT / 2k / XP and Unix, if the path is a Unicode object, the result will be a list of Unicode objects.

a source

+2
source

For those interested in a complete solution:

 dirpath = raw_input() 

has been changed to:

 dirpath = raw_input().decode(sys.stdin.encoding) 

This allowed passing the os.walk () argument to unicode, as a result of which the names of the files that he returned were also unicode.

To write them to or from a file (my second question), I used the codecs.open () function

+2
source

All Articles