Sort and compare strings by locale in Haskell?

Is it possible to sort strings with national characters in Haskell (GHC) correctly? In other words, the correct sorting of characters according to the current locale settings?

I found only the ICU module, but it requires the installation of an additional library, since it is not a standard part of Linux distributions. I would like the solution to be based on the POSIX C library (glibc like), so there will be no problems handling the additional dependency.

+8
string haskell localization locale
source share
1 answer

Recommended Method: text-icu

The recommended method for reliable string processing in locally-sensitive mode is text and text-icu , as you saw. The text library is provided in a standard set of libraries, the Haskell platform .

Example, sorting Turkish strings:

{-# LANGUAGE OverloadedStrings #-} import Data.Text.IO as T import Data.Text.ICU as T import Data.List (sortBy) main = do let trLocale = T.Locale "tr-TR" str = "ÇIİĞÖŞÜ" strs = take 10 (cycle $ T.toLower trLocale str : str : []) mapM_ T.putStrLn (sortBy (T.compare [T.FoldCaseExcludeSpecialI]) strs) 

seems to correctly sort the lexicographic ordering by language, after the correct lower part of the Turkish line:

 *Main> main ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ çıiğöşü çıiğöşü çıiğöşü çıiğöşü çıiğöşü 

Do not use text-icu package

You asked in your question to avoid solutions that use additional libraries besides what Posix provides. Although text-icu is easy to install from Hackage ( cabal install text-icu ), it depends on the ICU C library, which is not available everywhere. In addition, there is no alternative to Posix that is equally reliable or comprehensive. Finally, text-icu is the only package that correctly performs conversions on multi-char characters.

With this in mind, Haskell's built-in Char and String types provide Data.Char , whose values ​​represent Unicode and, with functions that will convert to Unicode , are language-insensitive, using the wchar_t functions defined by the Open Group. In addition, we can do IOs on descriptors in a language-sensitive (textual) language.

 import System.IO import Data.Char import Data.List (sort) main = do t <- mkTextEncoding "UTF-8" hSetEncoding stdout t let str = "ÇIİĞÖŞÜ" strs = take 10 (cycle $ map toLower str : str : []) mapM_ putStrLn (sort strs) 

In fact, the GHC will use your default text locale for IO (e.g. UTF8). For many problems, this is likely to give the correct answer. You just need to know that in many cases this will also be wrong, since it is impossible to ensure correct operation without mass processing of text and rich support for conversion and comparison.

 *Main> main ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ ÇIİĞÖŞÜ çiiğöşü çiiğöşü çiiğöşü çiiğöşü çiiğöşü 

+13
source share

All Articles