Effectively read data from a structured file in C / C ++

I have a file as follows:

enter image description here

The file consists of two parts: header and data.

Part of the data is divided into pages of the same size. Each page contains data for a specific metric. For storing data for one indicator, it may take several pages (not necessarily sequential). Each page consists of a page title and a page body. The page header has a β€œNext Page” field, which is the index of the next page containing data for the same metric. The body of the page contains real data. All pages have the same and fixed size (20 bytes for the header and 800 bytes for the body (if the amount of data is less than 800 bytes, 0 will be filled)).

Part of the header consists of 20,000 elements, each element has information about a specific metric (point 1 β†’ point 20,000). The element has a field called "first page", which is actually the index of the first page containing the data for the metric.

File can be up to 10 GB.

Requirement: to reorder the file data as soon as possible, that is, pages containing data for one metric must be sequential and from metric 1 to metric 20,000 in accordance with the alphabetical order (part of the header must be updated accordingly).

The obvious approach: for each metric, read all the data for the metric (per page), write the data to a new file. But it takes a lot of time, especially when reading data from a file.

Are there any effective ways?

+4
source share
3 answers

First I read part of the header, and then sorted the metrics in alphabetical order. For each metric in the sorted list, I read all the data from the input file and write to the output file. To remove bottlenecks while reading data, I used memory mapping. The results showed that when using memory matching, the runtime for a 5 GB input file was reduced 5-6 times compared to when no memory matching was used. Thus, temporarily solve your problems. However, I will also consider @utnapistim's suggestions.

0
source

One possible solution is to create an index from a file containing the page number and the metric of the page to be sorted. Create this index as an array so that the first record (index 0 ) corresponds to the first page, the second record (index 1 ) corresponds to the second page, etc.

Then you sort the index using the specified metric.

When sorting, you get a new array that contains the new first, second, etc. write, and you read the entry of the input file to the output file in sorted index order.

+3
source

The obvious approach: for each metric, read all the data for the metric (per page), write the data to a new file. But it takes a lot of time, especially when reading data from a file.

Are there any effective ways?

Yes. . After you get a working solution, measure its effectiveness, and then determine which parts you want to optimize. What and how you optimize will largely depend on what results you get here (what are your bottlenecks).

A few general things to consider:

  • If you have one set of steps that read data for one metric and transfer them to output, you should be able to parallelize this (you have 20 sets of steps instead of one).
  • A 10Gb file will take a little time, no matter what hardware you run your code on (maybe you can run it on a supercomputer, but I ignore this case). You / your client can make a slower decision if it displays his progress / shows a progress bar.
  • Do not use string comparisons to sort;

Edit (comment for addressing)

Consider the following:

  • create a block offset list for the blocks you want to read

  • create a list of fixed-size workflows (for example, 10 workers)

  • each worker in standby mode will receive a file name and a block offset, then create an instance of std :: ifstream in the file, read the block and return it to the receiving object (and then request a different block number if they are left).

  • read pages should be transferred to the central structure that controls / stores the pages.

Also consider managing memory separately for blocks (for example, pre-allocate blocks of several blocks when you know the number of blocks to read).

+2
source

All Articles