How to change buffer size with boost :: iostreams?

My program reads dozens of very large files in parallel, just one line at a time. It seems that the main performance bottleneck is the time it takes to find the HDD from file to file (although I'm not quite sure how to check this), so I think it would be faster if I could buffer input.

I use C ++ - code like this to read my files using boost :: iostreams "filtering streams":

input = new filtering_istream; input->push(gzip_decompressor()); file_source in (fname); input->push(in); 

According to the documentation , file_source has no way to set the size of the buffer, but filtering_stream :: push looks like this:

 void push( const T& t, std::streamsize buffer_size, std::streamsize pback_size ); 

So, I tried input->push(in, 1E9) , and indeed, the use of my program memory started, but the speed did not change at all.

Am I just mistaken that reading buffering will improve performance? Or did I do it wrong? Is it possible to directly link file_source or create filtering_streambuf? If the latter, how does it work? The documentation is not filled with examples.

+6
c ++ boost
source share
1 answer

You should also look at the profile where the bottleneck is.

Perhaps this is at the core, perhaps at your hardware limit. Until you consult him to find out that you stumble in the dark.

EDIT:

Well, then a more thorough answer. According to Boost.Iostreams documentation, basic_file_source is just a wrapper around std::filebuf , which in turn is built on std::streambuf . To quote the documentation:

CopyConstructible and Assignable wrapper for std :: basic_filebuf, open in read-only mode.

streambuf provides a pubsetbuf method (maybe not the best link, but the first google appeared), which you can apparently use to control the size of the buffer.

For example:

 #include <fstream> int main() { char buf[4096]; std::ifstream f; f.rdbuf()->pubsetbuf(buf, 4096); f.open("/tmp/large_file", std::ios::binary); while( !f.eof() ) { char rbuf[1024]; f.read(rbuf, 1024); } return 0; } 

In my test (optimization off, though) I really got worse performance with a 4096 byte buffer than a 16 byte buffer, but YMMV is a good example of why you should always profile first :)

But, as you say, basic_file_sink does not provide any means to access this, since it hides the underlying filebuf in its private part .

If you think this is wrong, you can:

  • Encourage Boost developers to disclose such features, use the mailing list or traffic.
  • Create your own filebuf package that sets the size of the buffer. There's a section in the tutorial that explains the creation of custom sources, which can be a good starting point.
  • Write your own source based on what all the caching you like does.

Remember that your hard drive, as well as the kernel, already caches and buffers the reading of files, and I don’t think that you will increase caching performance even further.

And finally, a word about profiling. There are many powerful profiling tools for Linux, and I don’t even know half of them by name, but, for example, there is iotop which is kind of neat because it is very easy to use. This is pretty much like the top one, but instead shows disk related metrics. For example:

 Total DISK READ: 31.23 M/s | Total DISK WRITE: 109.36 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 19502 be/4 staffan 31.23 M/s 0.00 B/s 0.00 % 91.93 % ./apa 

tells me that my program spends more than 90% of its time waiting for IO, i.e. associated with IO. If you need something more powerful, I'm sure Google will help you.

And remember that benchmarking in hot or cold cache greatly affects the result.

+2
source share

All Articles