Context: Recoding a string from an external source to save to the database
From the gem, I get the string s , which has latin-1 encoded content and what I want to keep in the Rails model.
r = MyRecord.new(mystring: s)
Since my PostgreSQL database uses UTF-8 encoding, saving the model after setting its string field to a string causes an error when this string contains certain characters other than ASCII:
ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65 ...
I can easily solve this by recoding the string:
r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
(Since r.encoding returns #<Encoding:ASCII-8BIT> instead of #<Encoding:ISO-8859-1> , I pass the source encoding as the second argument . The gem that s produced probably does not know what file it is reads from the string latin1 encoded.)
Challenge: Avoid Hard Encoding of Target Encoding
It occurred to me that knowing about the encoding of a database row does not apply to the part of the code where I do this, preserving and, therefore, transcoding.
I can ask the model class for database encoding:
MyRecord.connection.encoding
This does not return Ruby Encoding , but returns a string containing the encoding name. Fortunately, the Encoding class can be requested with names (and some aliases ) to search for encodings:
Encoding.find 'UTF-8'
Unfortunately, different naming conventions are used: MyRecord.connection.encoding returns 'UTF8' ( no minus sign), and Encoding.find(...) should be passed 'UTF-8' ( with a minus sign) or 'CP65001' if we want it to return #<Encoding:UTF-8> .)
Sooooo close.
Question: is there a clean and / or recommended way
to avoid hard encoding of the target encoding and instead dynamically detect and use the database encoding for this
Discarded ideas
I donβt feel that doing string manipulation or pattern matching based on the result of MyRecord.connection.encoding or the contents of Encoding.aliases() would be better than just leaving hard-coded values ββin the code.
Changing Encoding.aliases() return value has no effect:
Encoding.aliases['UTF8'] = 'UTF-8' Encoding.find 'UTF8'
(and, in any case, it doesnβt feel good), and also does not change the return value of #names :
Encoding::UTF_8.names.push('UTF8') Encoding.find 'UTF8'
I think that both return only dynamically created collections or copies of base collections, and for good reason.