Premise
From what I understand, using the Fisher-Yates algorithm and the data that you have about the positions of the records, you should be able to get (and calculate) the list:
struct Entry { long long sourceStartIndex; long long sourceEndIndex; long long destinationStartIndex; long long destinationEndIndex; }
Problem
From now on, the naive solution is to search for each record in the source file, read it, and then search for the new record position in the target file and write it.
The problem with this approach is that it uses too many queries.
Decision
The best way to do this is to reduce the number of requests using two huge buffers for each of the files.
I recommend a small buffer for the source file (say, 64 MB) and a large buffer for the target file (as large as the user can afford - say, 2 GB).
Initially, the destination buffer will be mapped to the first 2 GB of the destination file. At this point, read the entire source file in pieces of 64 MB in the source buffer. When you read it, copy the corresponding entries to the destination buffer. When you get to the end of the file, the output buffer should contain all the necessary data. Write it to the destination file.
Then map the output buffer to the next 2 GB of the destination file and repeat the procedure. Continue until you write the entire output file.
Attention
Since the records are arbitrary sizes, it is very likely that you will have suffixes and prefixes for the records at the beginning and end of the buffers, so you need to make sure that you copied the data correctly!
Estimated Cost of Time
The execution time depends mainly on the size of the source file, the RAM available for the application, and the speed of reading on the hard disk. Assuming that the file is 40 GB in size, 2 GB of RAM and the read speed on the hard disk is 200 MB / s, the program will need to read 800 GB of data (40 GB * (40 GB / 2 GB)). Assuming the hard drive is not very fragmented, the time taken to search will be negligible. This means that reading will take one hour! But if, fortunately, the user has 8 GB of RAM available for your application, the time can only be reduced to 15-20 minutes.
I hope this will be enough for you, because I do not see another faster way.