Rails ActiveRecord string field encoding vs Ruby String encoding

Context: Transcoding a string from an external source for saving in the database

From a gem, I get a string s that has latin-1-encoded content and that I want to store in a Rails model.

r = MyRecord.new(mystring: s)
# ...
r.save

Because my PostgreSQL database uses UTF-8 encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:

ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xdf 0x65
...

I can solve this easily by transcoding the string:

r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
# ...
r.save

(Because r.encoding returns #<Encoding:ASCII-8BIT> instead of #<Encoding:ISO-8859-1>, I’m passing the source encoding as the second argument. The gem that produced s probably isn’t aware that the file it read the string from is latin1 encoded.)

Challenge: Avoid hard-coding the destination encoding

It occurred to me, that knowledge about the database’s string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.

I can ask the model’s class for the database’s encoding:

MyRecord.connection.encoding

This doesn’t return a Ruby Encoding object though, it returns a string containing the encoding’s name. Fortunately, the Encoding class can be queried with names (and some aliases) to look up encodings:

Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8

Unfortunately, different naming conventions are used: MyRecord.connection.encoding returns 'UTF8' (no minus sign) while Encoding.find(...) needs to be passed 'UTF-8' (with minus sign) or 'CP65001' if we want it to return #<Encoding:UTF-8>.)

Sooooo close.

Question: Is there a clean and/or recommended way

to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database’s encoding for that?

Discarded ideas

I don’t feel doing string manipulation or pattern matching on the result of MyRecord.connection.encoding or on the contents of Encoding.aliases() would be any better than just leaving the hard-coded values in the code.

Modifying Encoding.aliases()‘s return value doesn’t have any effect:

Encoding.aliases['UTF8'] = 'UTF-8'
Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8

(and doesn’t feel right either, anyway), nor does modifying the return value of #names:

Encoding::UTF_8.names.push('UTF8')
Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8

I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.


Source: ruby

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.