Multithreaded reading from a file in C ++?

My application uses a text file to store data in a file. I tested the fastest way to read it using multi-threaded operation. I used the following 2 methods:

  • Use as many threads as the NUMBER_OF_PROCESSORS environment variable. Each thread is in a different thread. Divide the total number of lines in the file equally for each stream. Parse the text.

  • Only one thread parses the entire file and loads the data into memory. Create streams (= NUMBER_OF_PROCESSORS - 1) to analyze data from memory.

The test was performed in different files of 100 kB - 800 MB in size. Data in the file:

100.23123 -42343.342555 ...(and so on) 4928340 -93240.2 349 ... ... 

The data is stored in a double 2D array.

Result: Both methods take approximately the same time to parse the file.

Question: Which method to choose?

Method 1 is bad for the hard drive, as multiple read access is performed in random places at the same time.

Method 2 is bad because the required memory is proportional to the size of the file. This can be partially overcome by restricting the container to a fixed size, deleting the analyzed content and filling it again from the reader. But this increases the processing time.

+6
source share
3 answers

Method 2 has a sequential bottleneck (single-threaded reading and output of work items). This will not change unlimitedly in accordance with the Amdal Act. This is a very fair and reliable method.

Method 1 has no bottleneck and will scale. Be sure not to cause random disk I / O. I would use a mutex to read only one stream at a time. Read in a large serial block, possibly 4-16MB. While the disk was searching for one head, it could read about 1 MB of data.

If parsing strings takes a considerable amount of time, you cannot use method 2 because of the large sequential part. It will not scale. If parsing is fast, use method 2 because it is easier to get.

To illustrate the concept of a bottleneck: Imagine 1,000,000 threads of computing asking for one stream of readers to give them strings. This reader stream will not be able to support line feeds as fast as they are required. You will not get 1 times more bandwidth. It will not scale. But if 1e6 streams are read regardless of a very fast I / O device, you will get throughput the 1st time because there is no bottleneck. (I used extreme numbers to make a point. The same idea applies to the small.)

+4
source

I would prefer a slightly modified method. I would read the data sequentially in one stream in large chunks. The finished piece is transferred to the thread pool, where the data is processed. This way you have parallel reading and processing

+1
source

With enough RAM, you can do this without a single-threaded bottleneck. For Linux:

1) mmap the entire file in RAM with MAP_LOCKED, you need to configure access rights in the root directory or system. Or without MAP_LOCKED for SSDs, they handle random access well.

2) give each thread an initial position. Process data from the first new line after the start position of the start to the first new line after the next position of the start of the stream.

PS What is your processor processor load? HDD is probably the bottleneck.

0
source

All Articles