Ruby conversion of string encoding from ISO-8859-1 to UTF-8 does not work

I am trying to convert a string from ISO-8859-1 to UTF-8, but I cannot get it to work. Here is an example of what I did in irb.

irb(main):050:0> string = 'Norrlandsvägen' => "Norrlandsvägen" irb(main):051:0> string.force_encoding('iso-8859-1') => "Norrlandsv\xC3\xA4gen" irb(main):052:0> string = string.encode('utf-8') => "Norrlandsvägen" 

I am not sure why Norrlandsvägen in iso-8859-1 will be converted to Norrlandsvägen in utf-8.

I tried to encode, encode !, encode (destinationEncoding, originalEncoding), iconv, force_encoding and all kinds of weird workarounds that I could think of, but nothing works. Can someone help me / point me in the right direction?

A ruby ​​newbie still pulls her hair like crazy, but feels grateful for all the answers here ... :)

Background: I am writing a gem that will download an xml file from some sites (which will have the iso-8859-1 encoding) and save it to the repository, and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep scaring me. In fact, any help would be greatly appreciated!

[UPDATE]: I realized that running tests like this in the irb console might give me different behavior, so here is what I have in my actual code:

 def convert_encoding(string, originalEncoding) puts "#{string.encoding}" # ASCII-8BIT string.encode(originalEncoding) puts "#{string.encoding}" # still ASCII-8BIT string.encode!('utf-8') end 

but the last line gives me the following error:

 Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8 

Thanks @Amadan answer below, I noticed that \xC3 is actually displayed in irb if you run:

 irb(main):001:0> string = 'ä' => "ä" irb(main):002:0> string.force_encoding('iso-8859-1') => "\xC3\xA4" 

I also tried to assign a new variable to the string.encode(originalEncoding) result, but got an even more terrible error:

 newString = string.encode(originalEncoding) puts "#{newString.encoding}" # can't even get to this line... newString.encode!('utf-8') 

and Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1

I was still completely lost in all this coding, but I am very grateful for all the answers and helped everyone to give me! Thanks a ton! :)

+6
source share
2 answers

You assign a string to UTF-8. It contains ä . UTF-8 represents ä with two bytes.

 string = 'ä' string.encoding # => #<Encoding:UTF-8> string.length # 1 string.bytes # [195, 164] 

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. It no longer contains ä . It contains two characters, Ã and ¤ .

 string.force_encoding('iso-8859-1') # => "\xC3\xA4" string.length # 2 string.bytes # [195, 164] 

Then you translate this to UTF-8 . Since this is not a reinterpretation, but a translation, you retain two characters, but are now encoded in UTF-8:

 string = string.encode('utf-8') # => "ä" string.length # 2 string.bytes # [195, 131, 194, 164] 

What you are missing is the fact that initially you do not have the ISO-8859-1 line, as it would be from your web service - you have gibberish. Fortunately, all this is in your console tests; if you read the website answer using the correct input encoding, everything should work fine.

For your console test, demonstrate that if you start with the corresponding line of ISO-8859-1, it all works:

 string = 'Norrlandsvägen'.encode('iso-8859-1') # => "Norrlandsv\xE4gen" string = string.encode('utf-8') # => "Norrlandsvägen" 

EDIT For your specific problem, this should work:

 require 'net/https' uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full") options = { :use_ssl => uri.scheme == 'https', :verify_mode => OpenSSL::SSL::VERIFY_NONE } response = Net::HTTP.start(uri.host, uri.port, options) do |https| https.request(Net::HTTP::Get.new(uri.path)) end body = response.body.force_encoding('ISO-8859-1').encode('UTF-8') 
+10
source

There is a difference between force_encoding and encode . The former sets the encoding for the string, while the latter actually encodes the contents of the string into a new encoding. Therefore, the following code causes your problem:

 string = "Norrlandsvägen" string.force_encoding('iso-8859-1') puts string.encode('utf-8') # Norrlandsvägen 

While the following code really encodes your content correctly:

 string = "Norrlandsvägen".encode('iso-8859-1') string.encode!('utf-8') 

Here's an example running in irb :

 irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1') => "Norrlandsv\xE4gen" irb(main):024:0> string.encoding => #<Encoding:ISO-8859-1> irb(main):025:0> string.encode!('utf-8') => "Norrlandsvägen" irb(main):026:0> string.encoding => #<Encoding:UTF-8> 
+1
source

All Articles