Quickly read specific bytes of multiple files in C / C ++

I searched the Internet about this issue, and although there are a lot of questions about reading / writing in C / C ++, I did not find this specific task.

I want to read from several files (256x256 files) only sizeof(double) bytes located at a specific position of each file. Right now my solution for each file is:

  • Open the file (read, binary mode):

    fstream fTest("current_file", ios_base::out | ios_base::binary);

  • Find the position I want to read:

    fTest.seekg(position*sizeof(test_value), ios_base::beg);

  • Read the bytes:

    fTest.read((char *) &(output[i][j]), sizeof(test_value));

  • And close the file:

    fTest.close();

It takes about 350 ms to work inside the for{ for {} } structure with iterations of 256x256 (one for each file).


Q: What do you think is the best way to implement this operation? How do you do this?

+7
c ++ performance optimization c
source share
6 answers

Maybe threading will help.

But at first you could try something simpler. Make two copies of your program, with one reading the first 32768 files, and the second half the second. Run both programs at the same time. Does it take less than 14 hours?

If not, then adding threads is probably useless. Defragmentation, as suggested above, can help.

Added : 14 hours are clearly erroneous, since it is almost 1 second per file. Alejandro's comment above says that with a solid state drive, the time is only 0.1 ms per file, only 6.5 seconds. Which seems fast to me.

So, I assume that Alejandro should repeat this approximately 7000 times, each time with a different piece of data from files 65536. If so, two more sentences:

  • Write a program for cat files a new file. You probably have enough on your SSD to do this, as your other SO question points to 32 GB of data and the SSD is probably several times that. Then each run uses only this single huge file, which removes 65535 open and close.

  • And instead of just concatenating, when creating a huge file, you can “change rows and columns” or “data strip”, providing location.

Further addition . Perhaps you have already considered this, with the phrase "writing read data to a single file."

+2
source share

If possible, I suggest reorganizing the data. For example, put all of these duplicates in a single file instead of distributing them across multiple files.

If you need to run the program several times, and the data does not change, you can create a tool that first optimizes the data.

The file problem is related to overhead:

  • {overhead} Expand your hard drive.
  • {overhead} Search for a file.
  • Positioning inside the file.
  • Reading data.
  • {Closing the file adds very little performance.}

On most file systems that use large amounts of data, read data is optimized for longer duration than any overhead. Requests will be cached and sorted for optimal access to the disk. Unfortunately, in your case you are not reading enough data so that the overhead costs are now longer than the readings.

I suggest trying to queue a data read operation. Take 4 streams, each opens a file and reads doubles, and then puts them into the buffer. The idea here is to stagger operations.

  • Topic 1 opens the file.
  • Topic 2 opens the file while stream 1 is positioning.
  • Thread 3 opens the file while thread 2 is positioning, and thread 1 is reading data.
  • Topic 4 opens the file, stream 3 positions, stream 2, stream 1 closes.

We hope that these threads can keep the hard drive busy enough to not slow down; continuous activity. You can try this in one thread first. If you need better performance, you may want to send commands directly to the drive (order them first).

+2
source share

If you really want to optimize this, you probably want to discard the C ++ fstream stuff, or at least disable its buffering. fstream does a lot of memory allocation, and freeing and buffering can be read in more data than necessary. The OS will most likely need to read a whole page to get the few bytes you need, but fstream will probably want it to copy at least as many (and possibly more that require more reads) to its buffers, which will take time.

Now we can move on to larger victories. You might want to use the IO OS routines directly. If you are using a POSIX system (such as Linux), then open , lseek , read and close well suited for this and may be required if you do not have the following system calls.

If all the files that you are trying to read from life in one directory (folder) or under one, you may find that opening the directory with opendir or open("directory_name", O_DIRECTORY) (depending on whether you need to read the directory entries yourself), and then calling openat , which takes a file descriptor for the directory entry, because one of its arguments will speed up the opening of each file, since the OS will not work as hard as looking for the file you are trying to open every time (this data is probably will be located in the OS file system cache, but all it takes time and lots of tests).

Then you can read the data using the pread system call without having to search for the data you need. pread accepts the offset rather than using the idea of ​​the OS of the current search point. This saves you at least one system call.

Edit

If your system supports asynchronous I / O, this should speed things up, since you can continue to work and let the OS know what you need before you remove it (this allows the OS graphics to read the disk better, especially for spinning disks), but it can get complicated. This will probably save you a lot of time.

+1
source share

Given the nature of the problem, I'm not sure how much more performance you can squeeze out of it. If the files are distributed between several different disks, I could create a stream for each disk; that way you could paralyze several reads at a time. However, if all of them are on the same disk, then at some level, all readings will be serialized (I think I'm not a storage expert).

I / O is your limiting factor, not an algorithm.

0
source share

Does the fstream API support buffering by default? I wonder if switching the API to one that doesn't use buffering or disabling buffering with setvbuf can speed setvbuf up. The underlying OS cache operations may well mean that there is no difference, but it would be interesting to know.

0
source share

Invert iteration order. Or at least read the entire page of data from disk (say, 4 KB per file) and store it in memory until the next pass. Then you just need to direct the file system to every 512th pass. It will cost 256 MB of RAM, but save hundreds of GB of input / output files (even if you request only 8 bytes, the entire file should be transferred to the disk to the cache). And your OS disk cache replacement algorithm will most likely delete files with 65k calls. To open the old ones, so do not trust them to do the optimization for you.

0
source share

All Articles