Postgres coding error in sidekiq application

Question

Postgres coding error in sidekiq application

I am working on an application in which the ruby sidekiq process calls a third party and parses data in a database.

I use the sequel ad my orm.

I get some weird characters in the results, for example:

"Tweets en Ingl \ xE9s y en Espa \ xF1ol"

When it tries to save to postgres, the following error will occur:

Sequel :: DatabaseError: PG :: CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x73 0x20

The strange thing is that the string considers UTF-8, if I check the encoding name, it says:

name.encoding.name #UTF-8

What can I do to make the data in the correct format for postgres?

+2

ruby encoding postgresql sequel

dagda1 Oct 31 '13 at 17:20

source share

1 answer

mu is too short · Accepted Answer · 2013-10-31T18:03:16+0000

Just because the line claims to be UTF-8 does not mean that it is UTF-8. \xe9 é in ISO-8859-1 (AKA Latin-1), but this is not valid in UTF-8; similarly, \xf1 is ñ in ISO-8859-1, but is not valid in UTF-8. This suggests that the string is actually encoded in ISO-8859-1, not UTF-8. You can fix it with force_encoding to fix Ruby's confusion about the current encoding and encode recode it as UTF-8:

 > "Tweets en Ingl\xE9s y en Espa\xF1ol".force_encoding('iso-8859-1').encode('utf-8') => "Tweets en Inglés y en Español"

Therefore, before sending this row to the database, you want to:

 name = name.force_encoding('iso-8859-1').encode('utf-8')

Unfortunately, there is no way to reliably determine the string real encoding. The different encodings overlap, and there is no way to determine if è ( \xe8 in ISO-8859-1) or č ( \xe8 in ISO-8859-2) is the correct character without checking for manual control.

Postgres coding error in sidekiq application

More articles: