What is the reason for ZigZag encoding in protocol buffers and Avro?

Question

What is the reason for ZigZag encoding in protocol buffers and Avro?

ZigZag requires a lot of overhead to write / read numbers. In fact, I was stunned to see that he does not just write int / long values as he is, but does a lot of extra scrambling. There even a loop included: https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90

It seems that I can not find in protocol buffer documents or in Avro documents, or on my own, what is the advantage of scrambling such numbers? Why is it better to encode positive and negative numbers after encoding?

Why aren't they just written in mini-endian, big-endian, network order, which would require only their reading in memory and, possibly, the reverse sequence of bits? What do we buy with performance?

+6

performance protocol-buffers avro zigzag-encoding

Endrju Nov 26 '15 at 9:46

source share

1 answer

Hans passant · Accepted Answer · 2015-11-26T10:23:38+0000

This is a 7-bit variable length encoding. The first byte of the encoded value has a high bit set to 0, the subsequent bytes have the value 1. This is the way the decoder can determine how many bytes were used to encode the value. Byte order is always unimportant, regardless of machine architecture.

This is an encoding trick that allows you to write as few bytes as possible to encode a value. Thus, a length of 8 bytes with a value from -64 to 63 accepts only one byte. What is common, the range provided by the long is very rarely used in practice.

Gzip-style dense data packing without overhead was the design goal. Also used in the .NET Framework . The processor overhead required for en / decode is not significant. Already much lower than the compression scheme, this is a very small part of the input-output costs.

What is the reason for ZigZag encoding in protocol buffers and Avro?

More articles: