Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can i overcome this?

Question

Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can i overcome this?

I am running Python 2.5 (r25: 51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] when winning 32

When I ask Python

>>> "u11-Phrase 099.wav" < "u11-Phrase 1000.wav" True

It's good. When I ask

 >>> "u11-Phrase 100.wav" < "u11-Phrase 1000.wav" True

It's also good. But when I ask

 >>> "u11-Phrase 101.wav" < "u11-Phrase 1000.wav" False

So, according to Python, "u11-Phrase 100.wav" precedes "u11-Phrase 1000.wav", but "u11-Phrase 101.wav" appears after "u11-Phrase 1000.wav"! And this is problematic for me, because I'm trying to write a file rename program, and such sorting violates the functionality.

What can I do to overcome this? Should I write my own cmp function and check for edge cases, or is there a much simpler shortcut to give me the order I want?

On the other hand, if I change lines like

 >>> "u11-Phrase 0101.wav" < "u11-Phrase 1000.wav" True

However, these lines are taken from the list of directory files, for example:

 files = glob.glob('*.wav') files.sort() for file in files: ...

Therefore, I would prefer not to do string operations after they were created by glob. And no, I do not want to change the original file names in this folder.

Any clues?

+6

python sorting

Emre sevinç Dec 21 '09 at 13:17

source share

2 answers

You need to create the correct sort key for each file name. Something like this should do what you want:

 import re def k(s): return [w.isdigit() and int(w) or w for w in re.split(r'(\d+)', s)] files = ["u11-Phrase 099.wav", "u11-Phrase 1000.wav", "u11-Phrase 100.wav"] print files print sorted(files, key=k)

It outputs this result:

 ['u11-Phrase 099.wav', 'u11-Phrase 1000.wav', 'u11-Phrase 100.wav'] ['u11-Phrase 099.wav', 'u11-Phrase 100.wav', 'u11-Phrase 1000.wav']

The k function separates file names into sequences of numbers and (more importantly) turns these sequences into integers:

 >>> k('u11-Phrase 099.wav') ['u', 11, '-Phrase ', 99, '.wav']

Then we use the fact that Python knows how to sort lists - it sorts lists by comparing each element one by one. The end result is that

 >>> k('u11-Phrase 99.wav') < k('u11-Phrase 100.wav') True

then

 >>> 'u11-Phrase 99.wav' < 'u11-Phrase 100.wav' False

as you already know.

+9

Martin geisler Dec 21 '09 at 13:29

source share

Ned batchelder · Accepted Answer · 2009-12-21T13:23:39+0000

You are looking for a sort of people .

The reason 101.wav is at least 1000.wav is because computers (not just Python) sort the lines by character, and the first difference between the two lines is where the first line has "1" and the second line has a value "0". "1" is not less than "0", so the strings are compared as you saw.

People naturally parse these lines in their components and interpret numbers numerically, not lexically. The code I linked to above will do the same parsing.

Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can i overcome this?

More articles: