Reading ASCII-encoded files with Ruby 1.9 in UTF-8

Question

Reading ASCII-encoded files with Ruby 1.9 in UTF-8

I just upgraded from Ruby 1.8 to 1.9, and most of my word processing scripts now fail with an error invalid byte sequence in UTF-8. I need to either cross out invalid characters or indicate that Ruby should instead use ASCII encoding (or some kind of C function encoding stdio, that is, how the files were created) - how could I do any of these things?

The latter is preferred, because (as far as I can tell) there is nothing wrong with the files on the disk - if there are strange, invalid characters, they do not appear in my editor ...

+5

ruby encoding utf-8 ascii ruby-1.9

Doches 10 sept. '10 at 10:33

source share

1 answer

telent · Accepted Answer · 2010-09-10T11:12:22+0000

What is your locale in the shell? On Linux-based systems, you can verify this by running the command localeand changing it, for example.

$ export LANG=en_US

I assume that you are using the locale settings that encode UTF-8, and this leads Ruby to assume that the text files were created according to utf-8 encoding rules. You can see it by trying

$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8

For a more general discussion of how string coding has changed in Ruby 1.9, I highly recommend http://blog.grayproductions.net/articles/ruby_19s_string

(code examples assume that bash or similar shells - derivatives of C shells are different)

Reading ASCII-encoded files with Ruby 1.9 in UTF-8

More articles: