Why are two lines with the same bytes and encoding not identical in Ruby 1.9?

In Ruby 1.9.2, I found a way to make two lines that have the same bytes, the same encoding and are equal, but they have different length and different characters returned by [] .

This is mistake? If this is not a mistake, then I would like to fully understand it. What information is stored inside Ruby 1.9.2 String objects that allow these two lines to behave differently?

The code below reproduces this behavior. The comments that start with #=> show you what result I get from this script, and in the brackets of the word you will find out my opinion about this output.

 #!/usr/bin/ruby1.9 # coding: utf-8 string1 = "\xC2\xA2" # A well-behaved string with one character (ยข) string2 = "".concat(0xA2) # A bizarre string very similar to string1. p string1.bytes.to_a #=> [194, 162] (good) p string2.bytes.to_a #=> [194, 162] (good) puts string1.encoding.name #=> UTF-8 (good) puts string2.encoding.name #=> UTF-8 (good) puts string1 == string2 #=> true (good) puts string1.length #=> 1 (good) puts string2.length #=> 2 (weird!) p string1[0] #=> "ยข" (good) p string2[0] #=> "\xC2" (weird!) 

I run Ubuntu and compiled Ruby from the source code. My version of Ruby:

 ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux] 
+4
source share
3 answers

This is a Ruby bug and is fixed by r29848.

+8
source

Matz mentioned this question via Twitter:

http://twitter.com/matz_translator/status/6597021662187520

http://twitter.com/matz_translator/status/6597055132733440

"It's hard to define as a mistake, but it's not acceptable to leave it as it is. I would rather fix this problem."

+2
source

I think the problem is string encoding. Take a look at James Gray Grayscale: Ruby 1.9 String in an article on Unicode encoding.


Extra odd behavior:

 # coding: utf-8 string1 = "\xC2\xA2" string2 = "".concat(0xA2) string3 = 0xC2.chr + 0xA2.chr string1.bytes.to_a # => [194, 162] string2.bytes.to_a # => [194, 162] string3.bytes.to_a # => [194, 162] string1.encoding.name # => "UTF-8" string2.encoding.name # => "UTF-8" string3.encoding.name # => "ASCII-8BIT" string1 == string2 # => true string1 == string3 # => false string2 == string3 # => true string1.length # => 1 string2.length # => 2 string3.length # => 2 string1[0] # => "ยข" string2[0] # => "\xC2" string3[0] # => "\xC2" 

 string3.unpack('C*') # => [194, 162] string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2" string4.encoding.name # => "ASCII-8BIT" string4.force_encoding('UTF-8') # => "ยข" string3.force_encoding('UTF-8') # => "ยข" string3.encoding.name # => "UTF-8" 
+1
source

All Articles