I have very large (> 4 GB) files containing (millions) of fixed length binary records. I want to (effectively) attach them to records in other files by writing pointers (i.e. 64-bit record numbers) to these records at certain offsets.
To develop, I have a pair of lists of tuples (key, record number) sorted by key for each connection that I want to perform in this pair of files, say, A and B. Iterate through a pair of lists and match (key, record number A , record number B) tuples representing the combined records (assuming 1: 1 is displayed for simplicity). To complete the connection, I conceptually have to look for each entry A in the list and write the corresponding entry number B with the corresponding offset, and vice versa. My question is the fastest way to do this?
Since the list of merged records is sorted by key, the corresponding record numbers are essentially random. Assuming the file is much larger than the OS cache, executing a bunch of random requests and records seems extremely inefficient. I tried to partially sort the record numbers by placing the mappings A-> B and B-> A in a sparse array and washing the densest clusters of records to disk whenever I run out of memory. This can significantly increase the likelihood that the corresponding entries will be cached for the cluster after updating its first pointer. However, even at this stage, as a rule, it is better to make a bunch of requests and blind entries, or read fragments of the file manually,update relevant pointers and write snippets back? While the former method is much simpler and can be optimized by the OS to ensure a minimum number of sector reads (since it knows the size of the sector) and copies (it can avoid copying by reading directly to correctly aligned buffers), this seems to result in excessively high overhead.
( , Boost), Windows Linux - , API (, CreateFile / -). , , , , - , .