The fastest way to make a lot of small, blind writes in a huge file (in C ++)?

I have very large (> 4 GB) files containing (millions) of fixed length binary records. I want to (effectively) attach them to records in other files by writing pointers (i.e. 64-bit record numbers) to these records at certain offsets.

To develop, I have a pair of lists of tuples (key, record number) sorted by key for each connection that I want to perform in this pair of files, say, A and B. Iterate through a pair of lists and match (key, record number A , record number B) tuples representing the combined records (assuming 1: 1 is displayed for simplicity). To complete the connection, I conceptually have to look for each entry A in the list and write the corresponding entry number B with the corresponding offset, and vice versa. My question is the fastest way to do this?

Since the list of merged records is sorted by key, the corresponding record numbers are essentially random. Assuming the file is much larger than the OS cache, executing a bunch of random requests and records seems extremely inefficient. I tried to partially sort the record numbers by placing the mappings A-> B and B-> A in a sparse array and washing the densest clusters of records to disk whenever I run out of memory. This can significantly increase the likelihood that the corresponding entries will be cached for the cluster after updating its first pointer. However, even at this stage, as a rule, it is better to make a bunch of requests and blind entries, or read fragments of the file manually,update relevant pointers and write snippets back? While the former method is much simpler and can be optimized by the OS to ensure a minimum number of sector reads (since it knows the size of the sector) and copies (it can avoid copying by reading directly to correctly aligned buffers), this seems to result in excessively high overhead.

( , Boost), Windows Linux - , API (, CreateFile / -). , , , , - , .

+5
4

, A- > B B- > A , . , .

, , , . mmap() * NIX CreateFileMapping() Windows.

, . 32MB. - , mmap() , , msync(), , munmap(), .

, . , ( ), IO.

, , , IO . , , : (1) IO (IOPS), (2) . ( IOPS . 3-5 .) , , / 50 /: 50 . - 50 , . , , .

- - : , - 128 . , .

. , . , crapload IO, . / ​​ RAID10 800 IOPS 400 IOPS. ( .)

, . Boost.Asio, - .

P.S. , ( ) . , - . //etc, IO ( ), ;)

+3

, , . :

  • .

B + Trees , . .

, B + , , . , , node, B + . , , .

EDIT: , - :


+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
      ||      ^     |     ^    |          ^
      |\------/     |     |    |          |
      \-------------------/    |          |
                    |          |          |
                    \----------+----------/

.. B + , B +.

+4

, (key, A, B), , ( A, B). A, A, B, B, B-, A.

, , :

2,4 HP Pavilion 3- 32- Vista, 3 1,008- 56 , Delphi ( Win API).

8 Win API FileSeek/FileWrite 136 . 3 . 108 , O/S .

, - .

+1

, , , . , , . , :

, (, B) .

A. , ( , STXXL stxxl.sourceforge.net , )

Go through the record file A and the list of sorted pairs. Read a huge piece, make all the necessary changes in memory, write a fragment. Never touch that part of recording file A again (since the changes you planned to make come in sequential order)

Go back, sort the pair file by index B (again, using external sorting). Use this to update the record file B in the same way.

+1
source

All Articles