PostgreSQL 9.1 using sorting in select statements

Question

PostgreSQL 9.1 using sorting in select statements

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language ( id serial NOT NULL, name_language character varying(128) NOT NULL, branch_id integer NOT NULL, language_id integer NOT NULL, .... )

The name_language attribute contains names in different languages. The language is specified by the foreign key language_id.

I created several indexes:

 /* us english */ CREATE INDEX idx_branch_language_2 ON branch_language USING btree (name_language COLLATE pg_catalog."en_US" ); /* catalan */ CREATE INDEX idx_branch_language_5 ON branch_language USING btree (name_language COLLATE pg_catalog."ca_ES" ); /* portuguese */ CREATE INDEX idx_branch_language_6 ON branch_language USING btree (name_language COLLATE pg_catalog."pt_PT" );

Now, when I make a choice, I do not get the expected results.

 select name_language from branch_language where language_id=42 -- id of catalan language order by name_language collate "ca_ES" -- use ca_ES collation

This generates a list of names, but not in the expected order:

 Aficions i Joguines Agència de viatges Aliments i Subministraments Aparells elèctrics i il luminació Art i Antiguitats Articles de la llar Bars i Restaurants ... Tabac Àudio, Vídeo, CD i DVD Òptica

As I expected, the last two entries will appear in different positions on the list.

Creating indexes works. I do not think that they are really necessary if you do not want to optimize performance.

The select statement, however, ignores the: collate "ca_ES" part.

This problem also occurs when I select other sorts. I tried "es_ES" and "pt_PT", but the results are similar.

+7

postgresql collate

Henri Oct 17 '11 at 14:25

source share

1 answer

Erwin brandstetter · Answer 1 · 2011-10-20T01:24:59+0000

I can not find flaws in your design. I tried.

Locales and sorting

I reviewed this issue again. Consider this test case on sqlfiddle . Everything seems to be working fine. I even created the locale ca_ES.utf8 on my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

 CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

I get the same results as in sqlfiddle above.

Note that matching names are identifiers and must be double in order to preserve CamelCase spelling, for example "ca_ES" . Maybe there was some confusion with other locales in your system? Check available mappings :

 SELECT * FROM pg_collation;

As a rule, sorting rules are deduced from system locales. Read more about the manual here . If you still get incorrect results, I will try to update your system and restore the locale for "ca_ES" . On Debian (and related Linux distributions) this can be done with:

 dpkg-reconfigure locales

Nfc

I have another idea: unnormalized UNICODE strings .

Maybe your 'Àudio' is actually '̀ ' || 'Audio' '̀ ' || 'Audio' ? This will be this symbol:

 SELECT U&'\0300A'; SELECT ascii(U&'\0300A'); SELECT chr(768);

Read more on sharp accents on Wikipedia .
You must SET standard_conforming_strings = TRUE use Unicode strings, as in the first line.

Please note that some browsers cannot correctly display unnormalized Unicode characters, and many fonts do not have the correct character for special characters, so you can’t see anything here or can't gibber. But UNICODE allows this nonsense. Check what you received:

 SELECT octet_length('̀A') -- returns 3 (!) SELECT octet_length('À') -- returns 2

If it is related to your database, you need to get rid of it or suffer the consequences. The fix is to normalize your lines to NFC . Perl has excellent UNICODE-foo skills, you can use their libraries in the plperlu function to do this in PostgreSQL. I did this to save me from madness.

Read the installation instructions in this great post on Unicode normalization in PostgreSQL by David Wheeler .
Read all the details about Unicode Normalization Forms at unicode.org .

PostgreSQL 9.1 using sorting in select statements

Locales and sorting

Nfc

More articles: