Java CollationKey sorts incorrectly

Question

Java CollationKey sorts incorrectly

I have a problem comparing strings. I want to compare two French texts "éd" and "ef" like this

Collator localeSpecificCollator = Collator.getInstance(Locale.FRANCE); CollationKey a = localeSpecificCollator.getCollationKey("éd"); CollationKey b = localeSpecificCollator.getCollationKey("ef"); System.out.println(a.compareTo(b));

This will type -1 , but in the French alphabet e will appear before é . But when we compare only e and é like this

 Collator localeSpecificCollator = Collator.getInstance(Locale.FRANCE); CollationKey a = localeSpecificCollator.getCollationKey("é"); CollationKey b = localeSpecificCollator.getCollationKey("e"); System.out.println(a.compareTo(b));

result 1 . Can you tell us what is wrong in the first part of the code?

+7

java compare locale

Ashot Aug 10 '12 at 10:09

source share

2 answers

From JavaDoc :

You can set the Collator strength property to determine the level of differences that are considered significant in comparison. Four benefits are provided: PRIMARY, SECONDARY, TERRITORY and IDENTICAL. The exact purpose of the strengths of language features depends on the language. For example, in Czech, “e” and “f” are considered primary differences, while “e” and “ě” are secondary differences, “e” and “E” are tertiary differences, and “e” and “e” are identical.

Try various benefits:

 localeSpecificCollator.setStrength(Collator.PRIMARY);

and see what happens.

0

user647772 Aug 10 '12 at 10:15

source share

assylias · Accepted Answer · 2012-08-10T13:17:59+0000

This is apparently the expected behavior, and also seems to be the correct way to sort alphabetically in French.

Android javadoc gives a hint why it behaves this way - I believe that the implementation details in android are similar, if not identical to the JDK standard:

The tertiary difference is ignored if there is a primary or secondary difference in each row.

In other words, since your 2 lines are sorted only by looking at the primary differences (excluding accents), the collator does not check for other differences.

It seems to be compatible with Unicode Collation Algorithm (UCA) :

Percentage differences are usually ignored if the base letters are different.

And this also seems to be the right way to sort alphabetically in French, according to the wikipedia article “ordre alphabetique” :

En première analysis, les caractères accentués, de même que les majuscules, on the le même rang alphabétique que le caractère fondamental
Si plusieurs mots ont le même rang alphabétique, at the same de le-ling-a-end-e-e-e-e-e-e-e-n-n-n-n-n-n, n -accent (pour le e, on l'ordre e, é, è, ê, ë)

In English: at first the order ignores the accents and the case - if 2 words cannot be sorted in this way, then the accents and the case are taken into account.

Java CollationKey sorts incorrectly

More articles: