Ruby reads the CSV file as UTF-8 and / or converts the ASCII-8Bit encoding to UTF-8

I am using ruby ​​1.9.2

I am trying to parse a CSV file containing several French words (e.g. spécifié) and put the contents in a MySQL database.

When I read lines from a CSV file,

file_contents = CSV.read("csvfile.csv", col_sep: "$") 

Items are returned as strings that are ASCII-8BIT encoded (spécifié becomes sp \ xE9cifi \ xE9) and strings like "spécifié" are then NOT properly stored in my MySQL database.

Yehuda Katz says that ASCII-8BIT is really “binary” data, which means that CSV does not know how to read the corresponding encoding.

So, if I try to force CSV to force the encoding as follows:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")

I get the following error

 ArgumentError: invalid byte sequence in UTF-8: 

If I go back to the original ASCII-8BIT encoded strings and consider the line my CSV reads as ASCII-8BIT, it looks like "Non sp \ xE9cifi \ xE9" instead of "Non spécifié".

I cannot convert "Non sp \ xE9cifi \ xE9" to "Non spécifié" by doing this "Non sp\xE9cifi\xE9".encode("UTF-8")

because i get this error:

Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8 ,

which Katz pointed out because ASCII-8BIT is not really a proper string "encoding".

Questions:

  • Can I get a CSV to read my file in the appropriate encoding? If so, how?
  • How to convert ASCII-8BIT string to UTF-8 for proper storage in MySQL?
+44
string ruby encoding csv utf-8
Aug 13 '11 at 1:27
source share
3 answers

deceze is right, that is, the coded text of ISO8859-1 (AKA Latin-1). Try the following:

 file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1") 

And if that doesn't work, you can use Iconv to commit individual lines with something like this:

 require 'iconv' utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first 

If latin1_string is "Non sp\xE9cifi\xE9" , then utf8_string will be "Non spécifié" . In addition, Iconv.iconv can expand entire arrays at a time:

 utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings) 

With newer Rubies you can do things like this:

 utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8') 

where latin1_string considers it to be in ASCII-8BIT, but really is in ISO-8859-1.

+50
Aug 13 '11 at 2:20
source share

With ruby> = 1.9 you can use

 file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8") 

The value of ISO8859-1:utf-8 matters: the csv file is encoded by ISO8859-1, but converts the contents to utf-8

If you prefer a more verbose code, you can use:

 file_contents = CSV.read("csvfile.csv", col_sep: "$", external_encoding: "ISO8859-1", internal_encoding: "utf-8" ) 
+21
Nov 20 '15 at 19:15
source share

I have been dealing with this problem for a while, and not some other solutions that worked for me.

What the trick did was to save the conflicting line in a binary file, and then read the file in normal mode and use this line to feed the CSV module:

 tempfile = Tempfile.new("conflictive_string") tempfile.binmode tempfile.write(conflictive_string) tempfile.close cleaned_string = File.read(tempfile.path) File.delete(tempfile.path) csv = CSV.new(cleaned_string) 
0
Nov 20 '15 at 16:03
source share



All Articles