When to create your own buffer system for input-output (C ++)?

I have to deal with very large text files (2 GB), they must be read / written line by line. To write 23 million lines using streamstream is very slow, so at first I tried to speed up the process of writing large pieces of lines in the memory buffer (for example, 256 MB or 512 MB), and then write the buffer to a file. This did not work, the performance is more or less same. I have the same problem as reading files. I know that the I / O operations are buffered by the STL I / O system, and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).

Any idea on how to improve performance?

PS: I was thinking about using a background child process (or stream) to read / write data blocks while the program is processing data, but I do not know (mainly in the case of a subprocess) if it will be worthy.

+6
c ++ performance linux io buffer
source share
7 answers

A 2 GB file is quite large, and you need to know all the possible areas that can act as bottlenecks:

  • Hard drive itself
  • Hard Drive Interface (IDE / SATA / RAID / USB?)
  • Operating system / file system
  • C / C ++ Library
  • Your code

I would start with some measurements:

  • How long does your code take to read / write a 2 GB file,
  • How fast can I run the command ' dd ' to read and write to disk? Example ...

    dd if=/dev/zero bs=1024 count=2000000 of=file_2GB

  • How long does it take to write / read using only large calls to fwrite () / fread ()

Assuming your drive is capable of reading / writing at a speed of around 40 Mbps (which is probably a realistic figure to start with), your 2 GB file cannot run faster than about 50 seconds.

How long has this been going on?

Hi Roddy, using the fstream read method with 1.1 GB files and large buffers (128 255 or 512 MB) takes about 43-48 seconds, and it's the same with fstream getline (line by line). cp takes about 2 minutes to copy the file.

In this case, your hardware binding. cp has to read and write, and will look back and forth across the surface of the disk, like crazy when it does. Thus, it will (as you see) be more than twice worse than a simple โ€œreadโ€ case.

To increase speed, the first thing I will try is a faster hard drive or SSD.

You did not say what is the interface on the disk? SATA is the easiest and fastest option. Also (obviously, this ...) make sure that the disk is physically located on the same computer that your code is running on, otherwise you are tied to the network ...

+10
source share

I also suggest memory mapped files, but if you are going to use boost, I think boost :: iostreams :: mapped_file is better than boost :: interprocess.

+8
source share

Perhaps you should look into files with memory mapping.

Check them out in this library: Boost.Interprocess

+5
source share

Just think, but avoid using std :: endl, as this will force the flash to fill the buffer. Instead of "new line" use "\ n".

+3
source share

Do not use the new one to allocate such a buffer:

Try: std :: vector <>

 unsigned int buffer_size = 64 * 1024 * 1024; // 64 MB for instance. std::vector<char> data_buffer(buffer_size); _file->read(&data_buffer[0], buffer_size); 

Also read the article using underscores in identifier names:. Please note that your code is ok, but.

+2
source share

Using getline () can be inefficient because the line buffer may need to be re-changed several times when data is added to it from the stream buffer. You can make this more efficient by pre-setting the line:

You can also set the iostreams buffer size to either very large or NULL (for unbuffered)

 // Unbuffered Accesses: fstream file; file.rdbuf()->pubsetbuf(NULL,0); file.open("PLOP"); // Larger Buffer std::vector<char> buffer(64 * 1024 * 1024); fstream file; file.rdbuf()->pubsetbuf(&buffer[0],buffer.size()); file.open("PLOP"); std::string line; line.reserve(64 * 1024 * 1024); while(getline(file,line)) { // Do Stuff. } 
+1
source share

If you are going to buffer the file yourself, I would advise you to do some testing using unbuffered I / O (setvbuf in the file you opened may disable library buffering).

Basically, if you are going to buffer yourself, you want to disable library buffering, as this will only hurt you. I donโ€™t know if there is a way to do this for STL I / O, so I recommend going to level C I / O.

0
source share

All Articles