In Ruby 1.9.2, I found a way to make two lines that have the same bytes, the same encoding and are equal, but they have different length and different characters returned by [] .
This is mistake? If this is not a mistake, then I would like to fully understand it. What information is stored inside Ruby 1.9.2 String objects that allow these two lines to behave differently?
The code below reproduces this behavior. The comments that start with #=> show you what result I get from this script, and in the brackets of the word you will find out my opinion about this output.
#!/usr/bin/ruby1.9 # coding: utf-8 string1 = "\xC2\xA2" # A well-behaved string with one character (ยข) string2 = "".concat(0xA2) # A bizarre string very similar to string1. p string1.bytes.to_a #=> [194, 162] (good) p string2.bytes.to_a #=> [194, 162] (good) puts string1.encoding.name #=> UTF-8 (good) puts string2.encoding.name #=> UTF-8 (good) puts string1 == string2 #=> true (good) puts string1.length #=> 1 (good) puts string2.length #=> 2 (weird!) p string1[0] #=> "ยข" (good) p string2[0] #=> "\xC2" (weird!)
I run Ubuntu and compiled Ruby from the source code. My version of Ruby:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
source share