Why are spaces ignored in natsort / strnatcmp / strnatcasecmp?

I use strnatcmp in my comparison function to sort user names in a table. For our Belgian client, we get some strange results. They have names such as "Van der Brocke" and "Vander Weir," and strnatcasecmp("Van der", "Vander") returns 0 !

As a natural comparison is aimed at sorting as a person, I don’t understand why spaces are completely ignored.

eg:.

 $names = array("Van de broecke", "Vander Veere", "Vande Muizen", "Vander Zoeker", "Van der Programma", "vande Huizen", "vande Kluizen", "vander Muizen", "Van der Luizen"); natcasesort($names); print_r($names); 

gives:

 Array ( [0] => Van de broecke [5] => vande Huizen [6] => vande Kluizen [2] => Vande Muizen [8] => Van der Luizen [7] => vander Muizen [4] => Van der Programma [1] => Vander Veere [3] => Vander Zoeker ) 

But a man would say:

 Array ( [0] => Van de broecke [4] => Van der Programma [8] => Van der Luizen [5] => vande Huizen [6] => vande Kluizen [2] => Vande Muizen [7] => vander Muizen [1] => Vander Veere [3] => Vander Zoeker ) 

My solution now is to replace all spaces with underscores, which are handled appropriately. Two questions: Why natsort work like this? Is there a better solution?

+7
php natural-sort
source share
3 answers

If you look in the source code, you can see this, which definitely looks like an error: http://gcov.php.net/PHP_5_3/lcov_html/ext/standard/strnatcmp.c.gcov.php (scroll down to line 130) :

  //inside a while loop... /* Skip consecutive whitespace */ while (isspace((int)(unsigned char)ca)) { ca = *++ap; } while (isspace((int)(unsigned char)cb)) { cb = *++bp; } 

Please note that the link is for 5.3, but the same code still exists in 5.5 ( http://gcov.php.net/PHP_5_5/lcov_html/ext/standard/strnatcmp.c.gcov.php ) Admittedly, my knowledge Cs are limited, but basically this seems to advance a pointer to each line if the current character is space, essentially ignoring that character in sorting. Commentary implies that he does this only if the spaces are consistent; however, there is no verification that the previous character was actually space in the first place. This will require something like

 //declare these outside the loop short prevAIsSpace = 0; short prevBIsSpace = 0; //....in the loop while (prevAIsSpace && isspace((int)(unsigned char)ca)) { //won't get here the first time since prevAIsSpace == 0 ca = *++ap; } //now if the character is a space, flag it for the next iteration prevAIsSpace = isspace((int)(unsigned char)ca)); //repeat with string b while (prevBIsSpace && isspace((int)(unsigned char)cb)) { cb = *++bp; } prevBIsSpace = isspace((int)(unsigned char)cb)); 

Someone who really knows C could probably write this better, but that’s the general idea.

In another potentially interesting case for your example, if you use PHP> = 5.4, this gives the same result as usort mentioned by Aaron Saray (it also loses key / value associations):

 sort($names, SORT_FLAG_CASE | SORT_STRING); print_r($names); Array ( [0] => Van de broecke [1] => Van der Luizen [2] => Van der Programma [3] => vande Huizen [4] => vande Kluizen [5] => Vande Muizen [6] => vander Muizen [7] => Vander Veere [8] => Vander Zoeker ) 
+2
source share

Take a look at bugs.php.net # 26412 (natsort () compressed a few spaces to 1 space). Apparently this behavior of "aa", "a" and "a" (note 2 spaces) are not sorted as identical lines.

+2
source share

Like other answers / commentators, this is a known issue. However, you can write your own view with usort (). Try this and see if it works:

 usort($names2, function($first, $second) { if ($first == $second) { return 0; } else { return (strtolower($first) < strtolower($second)) ? -1 : 1; } }); 

I noticed that the result is slightly different from your suggested answer:

You suggested:

 [4] => Van der Programma [8] => Van der Luizen 

But I'm sure it was a typo - they need to be swapped. :)

+1
source share

All Articles