Create a file that has integers common to two large files containing integers

In particular, the data of two large files with 64-bit integers produces a file with integers present in both files, and estimates the time complexity of your algorithm.

How would you solve this?

+7
source share
4 answers

You can sort the radix and then iterate over the sorted results, keeping matches. Radix is ​​O (DN), where D is the number of digits in numbers. The largest 64-bit number is 19 digits, so sorting sorting for 64-bit integers with a radius of 10 will work in about 19N or O (N), and the search is done in O (N). So this will work in O (N) time, where N is the number of integers in both files.

+3
source

I changed my mind; I really like the idea of ​​sorting @Ryan radix, except that I am slightly adapting it for this particular problem.

Suppose there are so many numbers that they do not fit into memory, but we have the whole disk that we want. (It is unreasonable, given how the question was formulated.)

Call input files A and B.

So, create 512 new files; call their file A_0 through A_255 and B_0 through B_255. File A_0 gets all the numbers from file A, whose high byte is 0. File A_1 gets all the numbers from file A, whose high byte is 1. File B_37 gets all numbers from file B, whose high byte is 37. And so on.

Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and these pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are fairly linear, which should be quite effective. (If not, adjust the number of bits you use for buckets ...)

It is still O (n log n), but at all times it is not required to keep everything in memory. (Here, the constant coefficient in the numbering sort is log (2 ^ 64) or so, so it is not really linear unless you have a lot more than 2 ^ 64 numbers. Incredible even for the largest disks.)

[edit, clarify]

The whole point of this approach is that you do not need to sort the two lists. That is, with this algorithm, you can never actually list the elements of any list in order.

Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you just notice that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255 . So, you start with A_0 and B_0, find the numbers common to these files, add them to the output, and then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.

To find the common numbers between A_0 and B_0, you simply overwrite ... Create a file A_0_0 containing all the elements of A_0 with the second byte equal to zero. Create a file A_0_1 containing all the elements of A_0 with a second byte equal to 1. And so on. After all the elements A_0 and B_0 are placed in A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 yourself, because you no longer need them.

Then you recurs to A_0_0 and B_0_0 to find common elements, deleting them as soon as they are known ... And so on.

When you finally fall into buckets that have only one element (possibly repeating), you can immediately decide whether to add this element to the output file.

In no case does this algorithm consume more than 2 + epsilon in the original space required to store two files, where epsilon is less than half a percent. (The proof remains as an exercise for the reader.)

I honestly believe that this is the most efficient algorithm among all these answers if the files are too large to fit in memory. (As a simple optimization, you can return to the std :: set solution if and when the buckets get small enough.)

+4
source

Read the integers from both files into two set (this takes O(N*logN) ), then O(N*logN) over the two sets and writes the common elements to the output file (this takes O(N) ). The complexity summary is O(N*logN) .

Note The iterative part will be faster if we store integers in vector and then sort them, but here we will use much more memory if there are a lot of duplicate integers in the files.

UPD . You can also store only single integers from one of the files in memory:

Read the values ​​from the smaller files in set . Then read the values ​​from the second files one at a time. For each next number x check the availability in the set with O(logN) . If it exists, print it and remove it from the kit so as not to print it twice. The complexity remains O(N*logN) , but you use the memory needed only to store various integers from the smallest file.

+3
source

Assuming the files are too large to fit in memory, use the external least significant (LSD) radix sort on each of the files, and then iterate through both files to find the intersection:


LSD appearance based on N (N = 10 or N = 100 if the numbers are in lowercase format, N = 16/32/64 if in binary format):

Create N temporary files (0 - N-1). Iterate through the input file. For each integer, find the rightmost digit in the N base and add that integer to the temporary file corresponding to that digit.

Then create a new set of N temporary files, iterate over the previous set of temporary files, find the 2-in-right digit in the N database (add 0s if necessary) and add this whole file to the new temporary file corresponding to this digit. (and delete the previous set of temporary files)

Repeat until all numbers are covered. The last set of temporary files contains integers in sorted order. (Combine if you want into one file, otherwise treat the temporary files as a single list.)


Intersection Search:

Iterate through the sorted integers in each file to create a pair of iterators that point to the current integer in each file. For each iterator, if the numbers match, they are added to the output list and both iterators are promoted. Otherwise, for a smaller number, throw it away and advance the iterator for this file. Stop when the iterator ends.

(This outputs duplicates where there is duplicate input. If you want to remove duplicates, then the “advance iterator” step should advance the iterator until the next larger number appears or the file ends.)

+3
source

All Articles