Ruby character encoding using Base64.encode

Looking at the source of Ruby Base64.encode, I cannot determine which character encoding is converted to a string, if at all, before encoding this data in Base64. A Base64 encoded Utf-8 string will be very different from a Base64 encoded Utf-16 string. Does Ruby provide any promises for this operation?

+7
source share
2 answers

the exact guide has this to say:

encode64 (bin)
Returns a Base64 encoded version of bin. This method complies with RFC 2045.

Section 6.8 of RFC 2045 states:

6.8. Base64 Content-Transfer-Encoding

Base64 Content-Transfer-Encoding is designed to represent arbitrary octet sequences in a form that does not require readability. [...]

A 65-character subset of US-ASCII is used, allowing the display of 6 bits per printable character. (The optional 65th character, "=", is used to indicate a special processing function.)

So, Base64 encodes bytes in ASCII. If these bytes are actually a UTF-8 encoded string, then the UTF-8 string will be split into separate bytes, and these bytes will be converted to Base64; for example, if you have a UTF-8 string 'µ' , then you end up encoding the bytes 0xc2 and 0xb5 (in that order) into the Base64 representation of "wrU=\n" . If you start with the binary string "\xc2\xb5" (which just matches the version of UTF-8 'µ' ), you will get the same output "wrU=\n" .

When you decode "wrU=\n" , you will get the bytes of "\xc2\xb5" , and you will need to know that these bytes must be UTF-8 encoded text, and not some arbitrary block of bits. That's why you have a separate content type and character set metadata attached to Base64.

Similarly, if you have a UTF-16 string, it will be split into bytes, and these bytes will be encoded in the same way as any other byte string. Of course, this case is a bit more complicated due to problems with byte order, but why do we have content headers and character headers and specifications.

The main thing is that Base64 works with bytes, not characters. What format (UTF-8 text, UTF-16 text, PNG image, ...) is another problem. Base64 simply converts the byte stream to a subset of US ASCII, and then back to bytes; the format of these bytes must be specified separately.


I thought a little about the source, and the results may be of interest, even if they are not entirely relevant. The encode64 method is as follows:

 def encode64(bin) [bin].pack("m") end 

Then, if you look at Array#pack :

 static VALUE pack_pack(VALUE ary, VALUE fmt) { /*...*/ int enc_info = 1; /* 0 - BINARY, 1 - US-ASCII, 2 - UTF-8 */ 

and watch out for enc_info , you will see that the format 'm' will leave only enc_info , so the packed line will exit as US-ASCII, and therefore encode64 will output US ASCII output as expected.

+5
source

Example for encoding and decoding a utf-8 string in base64:

 text = "intérnalionálização" => "intérnalionálização" text.encoding => #<Encoding:UTF-8> encoded = Base64.encode64(text) => "aW50w6lybmFsaW9uw6FsaXphw6fDo28=\n" encoded.encoding => #<Encoding:US-ASCII> decoded = Base64.decode64(encode) => "int\xC3\xA9rnalion\xC3\xA1liza\xC3\xA7\xC3\xA3o" decoded.encoding => #<Encoding:US-ASCII> decoded = decoded.force_encoding('UTF-8') => "intérnalionálização" decoded.encoding => #<Encoding:UTF-8> 
+19
source

All Articles