Python glob module and unix 'find command do not recognize non-ascii

I am on Mac OS X 10.8.2

When I try to find files with file names that contain non-ASCII characters, I get no results, although I know for sure that they exist. Take for example console input

> find */Bärlauch* 

I am not getting any results. But if I try without umlaut, I will get

 > find */B*rlauch* images/Bärlauch1.JPG 

So the file definitely exists. If I rename a file replacing 'ä' with 'ae', the file will be found.

Similarly, the Python glob cannot find the file:

 >>> glob.glob('*/B*rlauch*') ['images/Bärlauch1.JPG'] >>> glob.glob('*/Bärlauch*') [] 

I realized that it should have something to do with the encoding, but my terminal is configured as utf-8, and I am using Python 3.3.0, which uses unicode strings.

+4
source share
2 answers

Mac OS X always uses denormalized characters for file names on HFS +. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.

 import unicodedata glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*')) 
+6
source

Python programs are basically text files. Usually people write them using only characters from the ASCII character set, and therefore, they don’t need to think about the encoding they write to: all character sets agree on how to decode ASCII characters.

You wrote a Python program using a non-ASCII character. So your program has implicit encoding (which you did not mention): to save such a file, you need to decide how you are going to represent a-umlaut on disk. I would suggest that perhaps your editor chose something non-Unicode for you.

In any case, there are two possibilities for such a problem: either you can limit yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read a text file with a specific encoding.

To do the first, you must replace a-umlaut with your Unicode escape sequence (which, it seems to me, is \x0228 , but cannot check at the moment). To do the latter, you must add an encoding declaration at the top of the file:

 # -*- coding: <your encoding> -*- 
+1
source

All Articles