Increased protocol buffer performance

I am writing an application that should quickly deserialize millions of messages from a single file.

What the application does is essentially get one message from the file, do some work, and then throw the message away. Each message consists of ~ 100 fields (not all of them are always parsed, but I need them all, because the application user can decide which fields he wants to work in).

At this point, the application consists of a loop that at each iteration is executed only by calling readDelimitedFrom() .

Is there a way to optimize the problem to better fit this case (splitting into multiple files, etc.). In addition, at this moment, due to the number of messages and the dimension of each message, I need a gzip file (and it is quite effective in reducing the size, since the value of the fields is quite repeated) - this, although it reduces performance.

+6
source share
1 answer

If CPU time is your bottleneck (it is unlikely if you boot directly from the hard drive using the cold cache, but this can be the case in other scenarios as well), then you can improve throughput:

  • If possible, use C ++ rather than Java, and reuse the same message object for each iteration of the loop. This reduces the time spent managing the memory, as the same memory will be reused every time.

  • Instead of using readDelimitedFrom() create one CodedInputStream and use it to read multiple messages:

     // Do this once: CodedInputStream cis = CodedInputStream.newInstance(input); // Then read each message like so: int limit = cis.pushLimit(cis.readRawVarint32()); builder.mergeFrom(cis); cis.popLimit(limit); cis.resetSizeCounter(); 

    (A similar approach works in C ++.)

  • Use Snappy or LZ4 compression, not gzip. These algorithms still have acceptable compression ratios, but are optimized for speed. (LZ4 is probably better, although Snappy was developed by Google using Protobufs, so you can test both in your dataset.)

  • Consider using Cap'n Proto rather than protocol buffers. Unfortunately, there is no Java version yet, but EDIT: there is capnproto-java , as well as implementations in many other languages. In the languages ​​that it supports, it has been shown that they are slightly faster. (Disclosure: I am the author of Cap'n Proto. I am also the author of the Buffers v2 protocol, which is an open source version.)

+14
source

All Articles