Ruby reads the CSV file as UTF-8 and / or converts the ASCII-8Bit encoding to UTF-8

Question

Ruby reads the CSV file as UTF-8 and / or converts the ASCII-8Bit encoding to UTF-8

I am using ruby 1.9.2

I am trying to parse a CSV file containing several French words (e.g. spécifié) and put the contents in a MySQL database.

When I read lines from a CSV file,

file_contents = CSV.read("csvfile.csv", col_sep: "$")

Items are returned as strings that are ASCII-8BIT encoded (spécifié becomes sp \ xE9cifi \ xE9) and strings like "spécifié" are then NOT properly stored in my MySQL database.

Yehuda Katz says that ASCII-8BIT is really “binary” data, which means that CSV does not know how to read the corresponding encoding.

So, if I try to force CSV to force the encoding as follows:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")

I get the following error

 ArgumentError: invalid byte sequence in UTF-8:

If I go back to the original ASCII-8BIT encoded strings and consider the line my CSV reads as ASCII-8BIT, it looks like "Non sp \ xE9cifi \ xE9" instead of "Non spécifié".

I cannot convert "Non sp \ xE9cifi \ xE9" to "Non spécifié" by doing this "Non sp\xE9cifi\xE9".encode("UTF-8")

because i get this error:

Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8 ,

which Katz pointed out because ASCII-8BIT is not really a proper string "encoding".

Questions:

Can I get a CSV to read my file in the appropriate encoding? If so, how?
How to convert ASCII-8BIT string to UTF-8 for proper storage in MySQL?

+44

string ruby encoding csv utf-8

user141146 Aug 13 '11 at 1:27

source share

3 answers

With ruby> = 1.9 you can use

 file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")

The value of ISO8859-1:utf-8 matters: the csv file is encoded by ISO8859-1, but converts the contents to utf-8

If you prefer a more verbose code, you can use:

 file_contents = CSV.read("csvfile.csv", col_sep: "$", external_encoding: "ISO8859-1", internal_encoding: "utf-8" )

+21

knut Nov 20 '15 at 19:15

source share

I have been dealing with this problem for a while, and not some other solutions that worked for me.

What the trick did was to save the conflicting line in a binary file, and then read the file in normal mode and use this line to feed the CSV module:

 tempfile = Tempfile.new("conflictive_string") tempfile.binmode tempfile.write(conflictive_string) tempfile.close cleaned_string = File.read(tempfile.path) File.delete(tempfile.path) csv = CSV.new(cleaned_string)

0

fguillen Nov 20 '15 at 16:03

source share

mu is too short · Accepted Answer · 2011-08-13 02:20

deceze is right, that is, the coded text of ISO8859-1 (AKA Latin-1). Try the following:

 file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

And if that doesn't work, you can use Iconv to commit individual lines with something like this:

 require 'iconv' utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

If latin1_string is "Non sp\xE9cifi\xE9" , then utf8_string will be "Non spécifié" . In addition, Iconv.iconv can expand entire arrays at a time:

 utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

With newer Rubies you can do things like this:

 utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

where latin1_string considers it to be in ASCII-8BIT, but really is in ISO-8859-1.

Ruby reads the CSV file as UTF-8 and / or converts the ASCII-8Bit encoding to UTF-8

More articles: