Ruby base64 encode / decode / unpack ('m') problems

The presence of strange ruby ​​coding:

ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=') => "a8dnsjg8aiw8jq==" ruby-1.9.2-p180 :619 > s.size => 16 ruby-1.9.2-p180 :620 > s.unpack('m0') ArgumentError: invalid base64 from (irb):631:in `unpack' ruby-1.9.2-p180 :621 > s.unpack('m') => ["k\xC7g\xB28<j,<\x8E"] ruby-1.9.2-p180 :622 > s.unpack('m').first.size => 10 ruby-1.9.2-p180 :623 > s.unpack('m').pack('m') => "a8dnsjg8aiw8jg==\n" ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s => false 

Any idea why this is not symmetrical !? And why does "m0" (decode64_strict) not work at all? The input string is padded with a multiple of 4 characters in the base64 alphabet. Here it is 14 x 6 bits = 84 bits, which is 10 1/2 of 8-bit bytes, i.e. 11 bytes. But the decoded string seems to leave the last nybble?

Am I missing something obvious or is this a mistake? Workaround? Wed http://www.ietf.org/rfc/rfc4648.txt

+4
source share
4 answers

There is no symmetry, because Base64 is not a one-to-one mapping for padded strings. Let's start with the actual decoded content. If you look at your decoded string in hexadecimal format (using, for example, s.unpack('H*') , it will be as follows:

 6B C7 67 | B2 38 3C | 6A 2C 3C | 8E 

I added borders for each input block to the Base64 algorithm: it takes 3 octets of input and returns 4 characters of output. Thus, our last block contains only one input octet, so the result will be 4 characters, which ends in "==" in accordance with the standard.

Let's see what the canonical encoding of this last block will be. In the binary representation, 8E is 10001110 . The RFC tells us to fill in the missing bits with zeros until we reach the required 24 bits:

 100011 100000 000000 000000 

I made groups of 6 bits, because this is what we need to get the corresponding characters from the Base64 alphabet. The first group (100011) is converted to 35 decimal places, and thus is j in the Base64 alphabet. The second (100000) is 32 decimal and, therefore, g. The two remaining characters must be padded as "==" in accordance with the rules. So canonical coding

 jg== 

If you look at jq ==, now in binary it will be

 100011 101010 000000 000000 

So the difference is in the second group. But since we already know that we are only interested in the first 8 bits ("==" tells us so → we will extract only one decoded octet from these four characters), we actually only care about the first two bits of the second group, because 6 bits groups 1 and 2 of the first bits of group 2 form our decoded octet. 100011 10 together again form our initial value of byte 8E . The remaining 16 bits are irrelevant to us and can be dropped.

This also implies why the concept of “strong” Base64 encoding makes sense: loose decoding discards any garbage at the end, while line decoding will check that the remaining bits are zero in the final group of 6. That's why your non-canonical encoding will be rejected by strict rules decoding.

+3
source

The RFC you linked clearly states that the last square of the form xx== corresponds to one octet of the input sequence. You cannot make 16 bits of information (two arbitrary octets) from 12, so rounding is not valid here.

Your string is rejected in strict mode because jq== cannot appear as a result of the correct Base64 encoding process. An input sequence whose length is not a multiple of 3 has a zero value, and your line has non-zero bits where they cannot be displayed:

  jq = = |100011|101010|000000|000000| |10001110|10100000|00000000| ^^^ 
+2
source

From Section 3.5 Canonical Encoding RFC4648

For example, if the input is only one octet for the base 64 encoding, then all six bits of the first character are used, but only the first two bits of the next character are used. These pad bits MUST be set to zero using appropriate encoders ...

and

In some environments, the change is critical, and therefore decoders MAY choose the encoding deviation if the bit bit has not been set to zero.

Your last four bytes ( jq== ) decode these binary values:

 100011 101010 ------ --**** 

The underlined bits are used to form the last encoded byte (hex 8E). The remaining bits (with asterisks below them) must be equal to zero (which would be encoded jg== , not jq== ).

Unpacking m forgives uppercase bits, which should be zero, but not. Unpacking m0 not as forgiving as it is allowed (see "CAN" in the quoted RFC bit). Packaging the decompressed result is not symmetric, because your encoded value is non-canonical, but the pack method creates canonical encoding (pad bits are zero).

+2
source

Thanks for the good explanations on b64. I supported all of you and accepted @emboss answer.

However, this is not the answer I was looking for. To better formulate the question, it would be,

How to insert a string of b64 characters so that it can be decoded to with zero 8-bit bytes by unpacking ('m0')?

From your explanations, now I see that this will work for our purposes:

 ruby-1.9.2-p180 :858 > s = "a8dnsjg8aiw8jq".ljust(16,'A') => "a8dnsjg8aiw8jqAA" ruby-1.9.2-p180 :859 > s.unpack('m0') => ["k\xC7g\xB28<j,<\x8E\xA0\x00"] ruby-1.9.2-p180 :861 > s.unpack('m0').pack('m0') == s => true 

The only problem is that the length of the decoded string is not preserved, but we can get around this.

0
source

All Articles