In Perl, see that é is a variant of e, E

I process utf-8 encoded strings in Perl. One of the tasks is that I need to know that words starting with a letter with diacritics, such as écrit, begin with the same letter as elephant, as well as England. I need a general solution, as I will work in several languages. I need to know this because I am creating letter headers for the index. Each of the words that I just mentioned will be stored in the "E" section.

Is there an easy way to do this?

+4
source share
3 answers

I make the assumption that you are sorting according to English sorting rules and have alphabetical text. The code below is a good start, but the real world is more complex. (For example, the Chinese text has different lexicographic rules depending on the context, for example, a general dictionary, lists of karaoke songs, an electronic list of doorbell names, ...) I can not imagine the ideal solution, because the question had so little information.

use 5.010; use utf8; use Unicode::Collate::Locale 0.96; use Unicode::Normalize qw(normalize); my $c = Unicode::Collate::Locale->new(locale => 'en'); say for $c->sort(qw( eye egg estate etc. eleven eg England ensure educate each equipment elephant ex- ending écrit )); say '-' x 40; for my $word (qw(écrit Ëmëhntëhtt-Rê Ênio ècole Ēadƿeard Ėmma Ędward Ẽfini)) { say sprintf '%s should be stored under the heading %s', $word, ucfirst substr normalize('D', $word), 0, 1; } __END__ each écrit educate eg egg elephant eleven ending England ensure equipment estate etc. ex- eye ---------------------------------------- écrit should be stored under the heading E Ëmëhntëhtt-Rê should be stored under the heading E Ênio should be stored under the heading E ècole should be stored under the heading E Ēadƿeard should be stored under the heading E Ėmma should be stored under the heading E Ędward should be stored under the heading E Ẽfini should be stored under the heading E 
+1
source

Text :: Unidecode can help you. It translates Unicode to ASCII.

 $ perl -Mutf8 -e 'use Text::Unidecode; print unidecode("écrit")' ecrit 
+3
source

Equality and row order are determined by things called collaborations. The hard part is that they depend on language and culture (the technical term is “locale”). For example, you may consider ø and o equivalent, but the Danes are different letters and must be ordered differently.

Perl module for working with Unicode::Collate .

Update: You can also use Perl native use locale support with use locale :

 use locale; use POSIX qw(setlocale LC_ALL); setlocale(LC_ALL, ''); # Set default locale from environment variables 

This makes built-in functions, such as sort and cmp , use locale rules to arrange strings. But be careful; changing the locale of a program can have unexpected consequences, such as changing a decimal point to a comma in printf output.

Update 2: The POSIX fields appear to be broken differently. You are better off using Unicode::Collate and Unicode::Collate::Locale .

+2
source

All Articles