Context: Transcoding a string from an external source for saving in the database
From a gem, I get a string
s that has
latin-1-encoded content and that I want to store in a Rails model.
r = MyRecord.new(mystring: s) # ... r.save
Because my PostgreSQL database uses
UTF-8 encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:
ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65 ...
I can solve this easily by transcoding the string:
r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1)) # ... r.save
#<Encoding:ASCII-8BIT> instead of
#<Encoding:ISO-8859-1>, I’m passing the source encoding as the second argument. The gem that produced
s probably isn’t aware that the file it read the string from is
Challenge: Avoid hard-coding the destination encoding
It occurred to me, that knowledge about the database’s string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.
I can ask the model’s class for the database’s encoding:
This doesn’t return a Ruby
Encoding object though, it returns a string containing the encoding’s name. Fortunately, the
Encoding class can be queried with names (and some aliases) to look up encodings:
Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8
Unfortunately, different naming conventions are used:
'UTF8' (no minus sign) while
Encoding.find(...) needs to be passed
'UTF-8' (with minus sign) or
'CP65001' if we want it to return
Question: Is there a clean and/or recommended way
to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database’s encoding for that?
I don’t feel doing string manipulation or pattern matching on the result of
MyRecord.connection.encoding or on the contents of
Encoding.aliases() would be any better than just leaving the hard-coded values in the code.
Encoding.aliases()‘s return value doesn’t have any effect:
Encoding.aliases['UTF8'] = 'UTF-8' Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8
(and doesn’t feel right either, anyway), nor does modifying the return value of
Encoding::UTF_8.names.push('UTF8') Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8
I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.