Efficient serialization of struct structure to disk

Question

Efficient serialization of struct structure to disk

I was tasked with replacing the C ++ code with Go, and I am completely new to the Go API. I use gob to encode hundreds of key / value entries to disk pages, but in gob encoding there are too many bloats that are not needed.

package main import ( "bytes" "encoding/gob" "fmt" ) type Entry struct { Key string Val string } func main() { var buf bytes.Buffer enc := gob.NewEncoder(&buf) e := Entry { "k1", "v1" } enc.Encode(e) fmt.Println(buf.Bytes()) }

This creates a lot of bloating that I don't need:

 [35 255 129 3 1 1 5 69 110 116 114 121 1 255 130 0 1 2 1 3 75 101 121 1 12 0 1 3 86 97 108 1 12 0 0 0 11 255 130 1 2 107 49 1 2 118 49 0]

I want to serialize every line of len followed by raw bytes:

 [0 0 0 2 107 49 0 0 0 2 118 49]

I save millions of records, so the extra encoding bloat increases the file size by about x10.

How can I serialize it last without manual coding?

+6

struct serialization go gob

bitloner Jun 03 '16 at 15:35

source share

3 answers

If you archive a file called a.txt containing the text "hello" (which is 5 characters), then zip will be about 115 bytes. Does this mean that the zip format is inefficient for compressing text files? Of course not. There is overhead. If the file contains "hello" hundred times (500 bytes), then with zipping it will have a file of 120 bytes ! 1x"hello" => 115 bytes, 100x"hello" => 120 bytes! We added 495 bytes, but the compressed size increased by only 5 bytes.

Something similar happens with the encoding/gob :

The implementation compiles its own codec for each data type in the stream and is most effective when one encoder is used to transmit a stream of values, amortizing the cost of compilation.

When you serialize the type value first, the type definition must also be included / passed, so the decoder can correctly interpret and decode the stream:

The flow of throats is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types.

Back to your example:

 var buf bytes.Buffer enc := gob.NewEncoder(&buf) e := Entry{"k1", "v1"} enc.Encode(e) fmt.Println(buf.Len())

He prints:

Now let me code for more than one type:

 enc.Encode(e) fmt.Println(buf.Len()) enc.Encode(e) fmt.Println(buf.Len())

Now the conclusion:

 60 72

Try it on the go playground .

Analysis of the results:

Additional values of the same Entry type cost only 12 bytes , and the first is 48 bytes, since type determination is also included (this is ~ 26 bytes), but this is a one-time overhead.

So basically you are passing 2 string s: "k1" and "v1" , which are 4 bytes, and the length of the string should also be included, using 4 bytes (size int on 32-bit architectures) gives you 12 bytes, which is "minimal". (Yes, you could use a smaller type for length, but that would have its limitations. Variable length encoding would be the best choice for small numbers, see encoding/binary .

All in all, encoding/gob does a pretty good job for your needs. Do not be fooled by the initial impressions.

If this 12 bytes for one Entry too much for you, you can always transfer the stream to compress/flate or compress/gzip to further reduce the size (in exchange for slower encoding / decoding and a slightly higher memory requirement for the process )

Demonstration:

Let me test 3 solutions:

Using bare output (no compression)
Using compress/flate to compress encoding/gob
Using compress/gzip to compress encoding/gob

We will record a thousand records, changing the keys and values of each of them, being "k000" , "v000" , "k001" , "v001" , etc. This means that the size of the uncompressed Entry is 4 bytes + 4 bytes + 4 bytes + 4 bytes = 16 bytes (2x 4 bytes, 2x4 bytes long).

The code is as follows:

 names := []string{"Naked", "flate", "gzip"} for _, name := range names { buf := &bytes.Buffer{} var out io.Writer switch name { case "Naked": out = buf case "flate": out, _ = flate.NewWriter(buf, flate.DefaultCompression) case "gzip": out = gzip.NewWriter(buf) } enc := gob.NewEncoder(out) e := Entry{} for i := 0; i < 1000; i++ { e.Key = fmt.Sprintf("k%3d", i) e.Val = fmt.Sprintf("v%3d", i) enc.Encode(e) } if c, ok := out.(io.Closer); ok { c.Close() } fmt.Printf("[%5s] Length: %5d, average: %5.2f / Entry\n", name, buf.Len(), float64(buf.Len())/1000) }

Output:

 [Naked] Length: 16036, average: 16.04 / Entry [flate] Length: 4123, average: 4.12 / Entry [ gzip] Length: 4141, average: 4.14 / Entry

Try it on the go playground .

As you can see: the bare output is 16.04 bytes/Entry , slightly smaller than the estimated size (due to the one-time tiny overhead flow rate discussed above).

When you use flate or gzip to compress the output, you can reduce the output size to 4.13 bytes/Entry , which is about ~ 26% of the theoretical size, I'm sure that will satisfy you. (Please note that with “real” data, the compression ratio is likely to be much higher, since the keys and values that I used in the test are very similar and, therefore, are very well compressed, and the ratio should be about 50% with the real data )

+15

icza Jun 03 '16 at 17:34

source share

The "manual coding" you are so afraid of is trivially done in Go using the standard encoding/binary package .

It seems you are saving the string lengths as 32-bit integers in capital letter format, so you can just go ahead and do it only in Go:

 package main import ( "bytes" "encoding/binary" "fmt" "io" ) func encode(w io.Writer, s string) (n int, err error) { var hdr [4]byte binary.BigEndian.PutUint32(hdr[:], uint32(len(s))) n, err = w.Write(hdr[:]) if err != nil { return } n2, err := io.WriteString(w, s) n += n2 return } func main() { var buf bytes.Buffer for _, s := range []string{ "ab", "cd", "de", } { _, err := encode(&buf, s) if err != nil { panic(err) } } fmt.Printf("%v\n", buf.Bytes()) }

link to the playing field .

Please note that in this example I write to the byte buffer, but only for demonstration purposes, io.Writer encode() writes to io.Writer , you can pass it an open file, a network socket, and something else that implements this interface .

+3

kostix Jun 03 '16 at 19:21

source share

jrwren · Accepted Answer · 2016-06-03T16:46:33+0000

Use protobuf to efficiently encode your data.

https://github.com/golang/protobuf

Your main thing will look like this:

 package main import ( "fmt" "log" "github.com/golang/protobuf/proto" ) func main() { e := &Entry{ Key: proto.String("k1"), Val: proto.String("v1"), } data, err := proto.Marshal(e) if err != nil { log.Fatal("marshaling error: ", err) } fmt.Println(data) }

You create an example.proto file as follows:

 package main; message Entry { required string Key = 1; required string Val = 2; }

You generate the go code from the proto file by running:

 $ protoc --go_out=. *.proto

You can check the generated file if you want.

You can run and see the result:

 $ go run *.go [10 2 107 49 18 2 118 49]

Efficient serialization of struct structure to disk

More articles: