Character Encoding Problem in Rails v3 / Ruby 1.9.2

Sometimes I get this β€œ invalid byte sequence in UTF-8 ” error when I read the contents from a file. Note. This only happens when there are special characters in the string. I tried to open the file without "r: UTF-8", but still getting the same error.

open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error 

File contents:

 # encoding: UTF-8 290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out 290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out 290958,"NO","02","Svaland","",58.4000,8.0500,, # this works 

This is a CSV file that I received from the outside, and I'm trying to import it into my database, it did not come with "# encoding: UTF-8" at the top, but I added this, since I read somewhere to fix this problem, but it not this way.: (

Environment:

  • Rails v3.0.3
  • ruby 1.9.2p0 (2010-08-18 version 29036) [x86_64-darwin10.5.0]
+7
ruby ruby-on-rails character-encoding
Jan 15 2018-11-11T00:
source share
2 answers

Ruby has the concept of external encoding and internal encoding for each file. This allows you to work with the file in UTF-8 at your source, even if the file is stored in a more esoteric format. If the default external encoding is UTF-8 (which is if you are on Mac OS X), all your input / output files will also work in UTF-8. You can verify this using File.open('file').external_encoding . What you do when you open the file and pass "r:UTF-8" forces the same external encoding that Ruby uses by default.

Most likely, your source document is not in UTF-8, and these non-ascii characters do not display UTF-8 correctly (if they were, you would either receive the correct characters or not make a mistake, and if they are incorrectly displayed, you will get wrong characters and no errors). What you have to do is try to determine the encoding of the original document and then Ruby transcode the document for reading, for example:

 File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") } 

If you need help determining the source encoding, give this Python library a whirlwind. It is based on automatic back-up character set detection that was in Seamonkey / Mozilla (and possibly still in Firefox).

+16
Jan 15 '11 at 1:13
source share

If you want to change the encoding of the file, you can use the gem 'charlock holmes'

https://github.com/brianmario/charlock_holmes

 $require 'charlock_holmes/string' content = File.read('test2.txt') if !content.is_utf8? detection = CharlockHolmes::EncodingDetector.detect(content) utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8' end 

You can then save your new contents in a temporary file and overwrite the original file.
I hope for this help.

+6
Feb 20 2018-12-12T00:
source share



All Articles