UCA + Natural Sort

I recently found out that PHP already supports Unicode Collation Algorithm through the intl extension :

$array = array ( 'al', 'be', 'Alpha', 'Beta', 'Álpha', 'Àlpha', 'Älpha', 'かたかな', 'img10.png', 'img12.png', 'img1.png', 'img2.png', ); if (extension_loaded('intl') === true) { collator_asort(collator_create('root'), $array); } Array ( [0] => al [2] => Alpha [4] => Álpha [5] => Àlpha [6] => Älpha [1] => be [3] => Beta [11] => img1.png [9] => img10.png [8] => img12.png [10] => img2.png [7] => かたかな ) 

As you can see, this seems to work just fine, even with mixed strings! The only drawback I have encountered so far is that there is no support for natural sorting , and I wonder what would be the best way to work around this so that I can combine the best of the two worlds.

I tried to specify the Collator::SORT_NUMERIC sort flag, but the result became more messy:

 collator_asort(collator_create('root'), $array, Collator::SORT_NUMERIC); Array ( [8] => img12.png [7] => かたかな[9] => img10.png [10] => img2.png [11] => img1.png [6] => Älpha [5] => Àlpha [1] => be [2] => Alpha [3] => Beta [4] => Álpha [0] => al ) 

However, if I run the same test only with img*.png values, I get the perfect output:

 Array ( [3] => img1.png [2] => img2.png [1] => img10.png [0] => img12.png ) 

Can anyone think about how to save Unicode sorting by adding natural sorting capabilities?

+6
sorting php natural-sort unicode
source share
3 answers

After digging a bit more in the documentation, I found a solution:

 if (extension_loaded('intl') === true) { if (is_object($collator = collator_create('root')) === true) { $collator->setAttribute(Collator::NUMERIC_COLLATION, Collator::ON); $collator->asort($array); } } 

Output:

 Array ( [0] => al [3] => Alpha [5] => Álpha [6] => Àlpha [7] => Älpha [1] => be [4] => Beta [10] => img1.png [11] => img2.png [8] => img10.png [9] => img12.png [2] => かたかな) 
+4
source share

This is trivial. You simply reprogram the list to numbers with zero entry. For example, using my ucsort script , which supports UCA, in this list of file names:

 % cat /tmp/numfiles img4.png img1.png img2.png img12.png img21.png img10.png img20.png img3.png img22.png 

will produce the desired result using Unicode :: Collate modules --preprocess hook to convert the runs of digits to zero:

 % ucsort --preprocess='s/(\d+)/sprintf "%020d", $1/ge' /tmp/numfiles img1.png img2.png img3.png img4.png img10.png img12.png img20.png img21.png img22.png 

Looking at the PHP documentation that you are quoting, it does not seem that this PHP library supports the full UCA sewing capabilities that Perl Unicode :: Collate supports the module. Actually, it looks more like Perls Unicode :: Collate :: Locale , except that the PHP library code does not seem to support the inherited sorting of the parameters that Perl code does.

I believe that if all else fails, you can call Perl code to perform the necessary actions.

+1
source share

Based on @tchrist's answer , I came up with the following:

 function sortIntl($array, $natural = true) { $data = $array; if ($natural === true) { $data = preg_replace_callback('~([0-9]+)~', 'natsortIntl', $data); } collator_asort(collator_create('root'), $data); return array_intersect_key($array, $data); } function natsortIntl($number) { return sprintf('%020d', $number); } 

Output:

 Array ( [0] => 1 [1] => 100 [2] => al [3] => be [4] => Alpha [5] => Beta [6] => Álpha [7] => Àlpha [8] => Älpha [9] => かたかな[10] => img1.png [11] => img2.png [12] => img10.png [13] => img20.png ) 

Still hoping for a better solution.

0
source share

All Articles