Binary Sort PostgreSQL UTF-8

Question

Binary Sort PostgreSQL UTF-8

I would like to have a sort that orders UTF-8 encoding 0x1234 below 0x1235 regardless of character matching in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2. Although it was easy to find them, I can’t even find a list of mappings. PostgreSQL supports far fewer answers to this particular question.

+7

postgresql utf-8 collation

chx Oct 15 '11 at 15:23

source share

3 answers

The sort order of the text depends on lc_collate (and not on the locale of the system!). The system locale is used only by default when creating a db cluster, unless you provide another language.

The behavior you expect only works with locale C Read all about this in a great guide :

C and POSIX records indicate the behavior of "traditional C", in which only ASCII letters "A" through "Z" are treated as letters, and sorting is performed strictly according to character bytes .

The emphasis is mine. PostgreSQL 9.1 has a couple of new features for mapping . Perhaps exactly what you are looking for.

+5

Erwin brandstetter Oct 15 '11 at 15:58

source share

Postgres uses the sorting defined by the language system of the system when creating the cluster.

You can try to execute ORDER BY encode (column, 'hex')

+1

Ramon poca Oct 15 '11 at 15:45

source share

chx · Accepted Answer · 2011-10-15T18:51:01+0000

The C locale will be executed. UTF-8 is designed so that the byte order is also a code order. This is not trivial, but consider how UTF-8 works:

  Number range Byte 1 Byte 2 Byte 3
 0000-007F 0xxxxxxx
 0080-07FF 110xxxxx 10xxxxxx
 0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

When sorting aka C locale binary data, the first unequal byte will work with etermine. What we wanted to see, if the two numbers encoded in UTF-8 are different, then the first unequal byte will be lower for a lower value. If the numbers are in different ranges, then the first byte will indeed be lower for the lower number. Within the same range, the order is determined literally by the same bits as without encoding.

Binary Sort PostgreSQL UTF-8

More articles: