after several days of experimenting with protocol buffers, I tried to compress the files. With Python, this is fairly easy to do and does not require any game with threads.
Since most of our code is written in C ++, I would like to compress / decompress files in one language. I tried the gzip boost library but couldn’t get it working (not compressing):
int writeEventCollection(HEP::MyProtoBufClass* protobuf, std::string filename, unsigned int compressionLevel) { ofstream file(filename.c_str(), ios_base::out | ios_base::binary); filtering_streambuf<output> out; out.push(gzip_compressor(compressionLevel)); out.push(file); if (!protobuf->SerializeToOstream(&file)) {
I searched for examples using GzipOutputStream and GzipInputStream with protocol buffers, but could not find an example.
As you probably noticed, I'm a newbie in streams at best and would really appreciate a fully working example, as in http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html (I have my address book, how to save it in a gziped file?)
Thanks in advance.
EDIT: working examples.
Example 1 following the answer here on StackOverflow
int writeEventCollection(shared_ptr<HEP::EventCollection> eCollection, std::string filename, unsigned int compressionLevel) { filtering_ostream out; out.push(gzip_compressor(compressionLevel)); out.push(file_sink(filename, ios_base::out | ios_base::binary)); if (!eCollection->SerializeToOstream(&out)) { cerr << "Failed to write event collection." << endl; return -1; } return 0; }
Example 2 of the following answer to the Google Protobuf discussion group :
int writeEventCollection2(shared_ptr<HEP::EventCollection> eCollection, std::string filename, unsigned int compressionLevel) { using namespace google::protobuf::io; int filedescriptor = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC, S_IREAD | S_IWRITE); if (filedescriptor == -1) { throw "open failed on output file"; } google::protobuf::io::FileOutputStream file_stream(filedescriptor); GzipOutputStream::Options options; options.format = GzipOutputStream::GZIP; options.compression_level = compressionLevel; google::protobuf::io::GzipOutputStream gzip_stream(&file_stream, options); if (!eCollection->SerializeToZeroCopyStream(&gzip_stream)) { cerr << "Failed to write event collection." << endl; return -1; } close(filedescriptor); return 0; }
Some performance comments (reading the current format and writing ProtoBuf 11146 files): Example 1:
real 13m1.185s user 11m18.500s sys 0m13.430s CPU usage: 65-70% Size of test sample: 4.2 GB (uncompressed 7.7 GB, our current compressed format: 7.7 GB)
Example 2:
real 12m37.061s user 10m55.460s sys 0m11.900s CPU usage: 90-100% Size of test sample: 3.9 GB
It seems that the Google method uses the processor more efficiently, a little faster (although I expect it to be within accuracy) and will create a 7% smaller dataset with the same compression option.