String Encoding in Ruby

I recently started working with Ruby encoding and am confused by some behavior.

I am using 2.2.3p173 and showing the following:

__ENCODING__ #=> #<Encoding:UTF-8> Default encoding in 2.2.3 "my_string".encoding #=> #<Encoding:UTF-8> Object.to_s.encoding #=> #<Encoding:US-ASCII> Object.new.to_s.encoding #=> #<Encoding:ASCII-8BIT> 

What is the reason for this discrepancy in coding?

+6
source share
2 answers

Nice to find!

The short answer is completely arbitrary, and it depends on how Ruby internally constructs the returned strings.

There are a number of C internal functions that build empty strings or US-ASCII encoded literals: rb_usascii_str_new and the like. They are often used to build strings by adding small snippets of strings. Almost every to_s method does this:

 [].to_s.encoding #<Encoding:US-ASCII> {}.to_s.encoding #<Encoding:US-ASCII> $/.to_s.encoding #<Encoding:US-ASCII> 1.to_s.encoding #<Encoding:US-ASCII> true.to_s.encoding #<Encoding:US-ASCII> Object.to_s.encoding #<Encoding:US-ASCII> 

So why not Object.new.to_s ? The key here is that Object#to_s is a method of returning to_s for each class, so in order to make it general, but still informative, he encoded it to display the value of the object’s internal pointer. The easiest way to do this is with sprintf and the %p specifier. BUT whoever encoded Ruby sprintf wrapper rb_sprintf became lazy and just set the encoding to NULL , which returns to ASCII-8BIT . Therefore, usually everything that returns a formatted string will have the following encoding:

 Object.new.to_s #<Encoding:ASCII-8BIT> nil.sort rescue $!.to_s.encoding #<Encoding:ASCII-8BIT> [].each.to_s.encoding #<Encoding:ASCII-8BIT> 

As for the strings defined by the script, they get the default UTF-8 encoding, as you would expect.

+4
source

Object defined in C if you try the following:

 String(123456).encoding #=> #<Encoding:ASCII-8BIT> "123456".encoding #=> #<Encoding:UTF-8> 

I did not dig a lot in the Ruby source code, but it looks like, for example, it encodes the encoding ( rb_usascii_str_new2 ) for to_s

+1
source

All Articles