Does ICU describe sorting of a list of strings of different languages?

My application can have strings consisting of different alphabets / languages ​​in one list. It seems that I can not find any information about what the correct method should be for sorting them, or any signs that the ICU supports this functionality.

List of examples:

  • Apple
  • an Apple
  • μήλο
  • Child
  • βρέφος
  • child
+4
source share
4 answers

Drop all the warnings above, here is one “standard universal multilingual sorting”: unicode sorting algorithm (UCA), which is NOT a code order. With a quick look at this page , the ICU seems to be handling a mixture of UCA and local preferences.

+5
source

There is no reasonable way to do this well. A universal variety does not exist for all languages, even within the same alphabet. Different languages ​​(cultures, mainly) have developed different sorting rules for how words are sorted.

The only way to do this always, in my opinion, is to use a simple old sort by name (for example, in Java, String.compareTo).

You can come up with some heuristics, depending on what your data represents. You can group strings based on guesses about the alphabet and language, and then use local sorting for each group. But you have to make it hard (the code itself), I think, because you will guess differently depending on the terms (for example, "mar" - an English verb or a Spanish noun?). It can be assumed that as a result of unpredictable “errors,” you will get a worse result than the naive Unicode sequence number.

As with anything else, it depends on how much you can afford to enter the solution and what kind of performance you need.

This sentence is not the answer you are looking for: if there is any way to identify the locale when storing the lines for the first time, you should do this and write it as part of the string metadata. Then you will not have this problem.

+5
source

As @Zac mentioned, there is no universal variety. The sorting of the code point will be consistent, but it may not be what the user expects.

Therefore, you should probably use the preferred sort order for the user selected language. Any code points not defined in this sort order will be grouped together.

+2
source

You can transliterate your “target” language (all in one script) and then sort. But languages ​​have conflicting sorting rules.

0
source

All Articles