Rails ActiveRecord String Field Encoding and Ruby String Encoding

Context: Recoding a string from an external source to save to the database

From the gem, I get the string s , which has latin-1 encoded content and what I want to keep in the Rails model.

 r = MyRecord.new(mystring: s) # ... r.save 

Since my PostgreSQL database uses UTF-8 encoding, saving the model after setting its string field to a string causes an error when this string contains certain characters other than ASCII:

 ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65 ... 

I can easily solve this by recoding the string:

 r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1)) # ... r.save 

(Since r.encoding returns #<Encoding:ASCII-8BIT> instead of #<Encoding:ISO-8859-1> , I pass the source encoding as the second argument . The gem that s produced probably does not know what file it is reads from the string latin1 encoded.)

Challenge: Avoid Hard Encoding of Target Encoding

It occurred to me that knowing about the encoding of a database row does not apply to the part of the code where I do this, preserving and, therefore, transcoding.

I can ask the model class for database encoding:

 MyRecord.connection.encoding 

This does not return Ruby Encoding , but returns a string containing the encoding name. Fortunately, the Encoding class can be requested with names (and some aliases ) to search for encodings:

 Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8 

Unfortunately, different naming conventions are used: MyRecord.connection.encoding returns 'UTF8' ( no minus sign), and Encoding.find(...) should be passed 'UTF-8' ( with a minus sign) or 'CP65001' if we want it to return #<Encoding:UTF-8> .)

Sooooo close.

Question: is there a clean and / or recommended way

to avoid hard encoding of the target encoding and instead dynamically detect and use the database encoding for this

Discarded ideas

I don’t feel that doing string manipulation or pattern matching based on the result of MyRecord.connection.encoding or the contents of Encoding.aliases() would be better than just leaving hard-coded values ​​in the code.

Changing Encoding.aliases() return value has no effect:

 Encoding.aliases['UTF8'] = 'UTF-8' Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8 

(and, in any case, it doesn’t feel good), and also does not change the return value of #names :

 Encoding::UTF_8.names.push('UTF8') Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8 

I think that both return only dynamically created collections or copies of base collections, and for good reason.

+5
source share
1 answer

The simplest and perhaps the purest solution to this problem would be to not call Encoding.find directly, but to have a useful method (perhaps in the module located in lib/yourapp ) that knows about the differences in the encoding you care about and returns to Encoding.find for all other inputs:

 module YourApp module DatabaseStringEncoding def find(name) case name when 'UTF8' Encoding::UTF_8 ... else Encoding.find(name) end end end 

This is easy to understand and detect (as opposed to changing Encoding directly, which is not visible to the reader of the code that encodes). Based on this find method, you can continue and implement a method that automatically transcodes a string into the encoding of a database string using YourRecord.connection.encoding .

I know that it would be more interesting to get Encoding.find to do exactly what you want, but I would say that this "dumb" approach will actually be better. :-)

+3
source

All Articles