Running time of your program: (l 1 x bs 1 xl 2 x bs 2 ) (where l 1 is the number of lines in the first file, and bs 1 is the block size for the first buffer, and l 2 is the number of lines in the second file, and bs 2 is the block size for the second buffer), since you have four nested loops. Since your block sizes are constant, you can say that your order is O (nx 400 xmx 400) or O (1600mn), or in the worst case O (1600n 2 ), which essentially ends with O (n 2 ).
You may have an O (n) algorithm if you are doing something like this (pseudo code follows):
map = new Map(); duplicate = new List(); unique = new List(); for each line in file1 map.put(line, true) end for for each line in file2 if(map.get(line)) duplicate.add(line) else unique.add(line) fi end for
Now duplicate will contain a list of duplicate elements, and unique will contain a list of unique elements.
In your original algorithm, you unnecessarily go through the second file for each line in the first file. This way you actually lose the hash advantage (which gives you O (1) lookup time). The trade-off in this case, of course, is that you need to store all 10 GB in memory, which is probably not so useful . Typically, in such cases, a trade-off between runtime and memory.
There is probably a better way to do this. I need to think about this a little more. If not, I'm sure someone will come up with a better idea :).
UPDATE
You can probably reduce memory usage if you can find a good way to hash the line (which you are reading from the first file) to get a unique value (i.e. 1 to 1 mapping between line and hash value). Essentially, you would do something like this:
for each line in file1 map.put(hash(line), true) end for for each line in file2 if(map.get(hash(line))) duplicate.add(line) else unique.add(line) fi end for
Here, the hash function is a function that performs hashing. This way you do not need to store all the lines in memory. You only need to store your hashed values. This may help you a bit. Even in the worst case (where you either compare two files that are identical or completely different), you can still get 10 GB in memory for a duplicate or unique list. You can deal with it with the loss of some information if you simply keep the number of unique or repeating elements instead of the elements themselves.