I changed my mind; I really like the idea of sorting @Ryan radix, except that I am slightly adapting it for this particular problem.
Suppose there are so many numbers that they do not fit into memory, but we have the whole disk that we want. (It is unreasonable, given how the question was formulated.)
Call input files A and B.
So, create 512 new files; call their file A_0 through A_255 and B_0 through B_255. File A_0 gets all the numbers from file A, whose high byte is 0. File A_1 gets all the numbers from file A, whose high byte is 1. File B_37 gets all numbers from file B, whose high byte is 37. And so on.
Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and these pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are fairly linear, which should be quite effective. (If not, adjust the number of bits you use for buckets ...)
It is still O (n log n), but at all times it is not required to keep everything in memory. (Here, the constant coefficient in the numbering sort is log (2 ^ 64) or so, so it is not really linear unless you have a lot more than 2 ^ 64 numbers. Incredible even for the largest disks.)
[edit, clarify]
The whole point of this approach is that you do not need to sort the two lists. That is, with this algorithm, you can never actually list the elements of any list in order.
Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you just notice that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255 . So, you start with A_0 and B_0, find the numbers common to these files, add them to the output, and then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.
To find the common numbers between A_0 and B_0, you simply overwrite ... Create a file A_0_0 containing all the elements of A_0 with the second byte equal to zero. Create a file A_0_1 containing all the elements of A_0 with a second byte equal to 1. And so on. After all the elements A_0 and B_0 are placed in A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 yourself, because you no longer need them.
Then you recurs to A_0_0 and B_0_0 to find common elements, deleting them as soon as they are known ... And so on.
When you finally fall into buckets that have only one element (possibly repeating), you can immediately decide whether to add this element to the output file.
In no case does this algorithm consume more than 2 + epsilon in the original space required to store two files, where epsilon is less than half a percent. (The proof remains as an exercise for the reader.)
I honestly believe that this is the most efficient algorithm among all these answers if the files are too large to fit in memory. (As a simple optimization, you can return to the std :: set solution if and when the buckets get small enough.)