Ruby CSV UTF8 coding error while reading

This is what I did:

csv = CSV.open(file_name, "r") 

I used this for testing:

 line = csv.shift while not line.nil? puts line line = csv.shift end 

And I came across this:

 ArgumentError: invalid byte sequence in UTF-8 

I read the answer here and this is what I tried

 csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8") 

I encountered the following error:

 Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8 

Then I came across a Ruby stone - charlock_holmes. I decided that I would try to use it to find the source encoding.

 CharlockHolmes::EncodingDetector.detect(File.read(file_name)) => {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"} 

So, I did this:

 csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8") 

And still got this:

 Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8 
+8
ruby csv
source share
1 answer

It looks like you have trouble finding the correct encoding for your file. CharlockHolmes will provide you with a useful hint :confidence=>37 , which simply means that the encoding detected may be incorrect.

Based on the error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found an encoding that goes through both of your error messages. Using String#encode easy to verify:

 "\x8F\x98".encode("UTF-8","cp1256") # => "ڈک" 

Your problem looks strictly related to the file, not the ruby.

If we are not sure which encoding to use and can agree to lose some character, we can use :invalid and :undef params for String#encode , in this case:

 "\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "ΕΉ?" 

another way is to use the Iconv *//IGNORE option for the target encoding:

 Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98") 

As a source for coding sentences, CharlockHolmes should be pretty good.

PS. String.encode was introduced in ruby ​​1.9. With ruby ​​1.8 you can use Iconv

+4
source share

All Articles