Find two lines in the file that are the same

I was asked this question in an interview with Amazon.

You have a file with many lines, but two lines are the same. Find these two lines. I gave an obvious answer that went N ^ 2 times. Then I came up with an answer that used a hash table, but they did not like this answer either because they said that it would not work if the file is in gigabytes. Another answer that I came up with, instead of storing the hash result in memory, created a file with the same name as the hash value, and saved the lines with the same hash value in the file. Either they could not understand my decision, or they did not like it.

Any thoughts?

thanks

+6
source share
3 answers

I can imagine two main classes of solutions to this problem:

  • Probabilistic decisions in memory. . You can try to solve this problem by storing a summary of the lines of the file in main memory. Then you can perform calculations in the main memory to determine possible duplicates, and then check each possible duplicate by looking at the disk. These solutions are probably the best because they have low memory usage, high efficiency and simulate disk access. Solutions in this category include

    • Calculate the hash of each line of the file, and then save the hashes. Any lines that have a hash collision are one possible pair of lines that may collide, and only these lines can be examined.
    • Use the Bloom filter to store all lines of the file, and then check only the pairs that collide in the Bloom Filter. This is essentially option (1), which is more economical in area.
  • Deterministic disk solutions . You can try to perform calculations with the entire data set on the disk, using the main memory as a temporary space for scratches. This will allow you to get accurate answers without holding the entire file in memory, but it will probably be slower if you do not perform later processing and cannot use data restructuring. Solutions in this category include

    • To sort the file, use an external sorting algorithm (external sorter, external sorting for the sake, etc.), then linear search for it for a pair of repeating elements.
    • Create a data structure on disk, such as a B-tree containing all the rows, then query the B-tree. This requires a lot of pre-processing time, but makes further operations on the file much faster.
    • Put everything in the database and query the database.

Hope this helps!

+4
source

You can use the Bloom filter:

http://en.wikipedia.org/wiki/Bloom_filter

Then you can detect (with several false positives) lines that are repeated and then stored in memory, and then view the file again.

Two passes through the file, very little memory usage, beautiful

+2
source

Run the lines and calculate the lengths of each line. You will get something like:

0: 4 1: 6 2: 10 3: 4 .... 

Compare only those lines that are the same length. Work with such an index can be further optimized (for example, not to store everything in a flat array, but in some tree or something else).

By the way, the second idea with the file may be rejected due to performance reasons. It is usually a bad idea to have frequent random I / O with a hard drive: try to store as much as you can in memory.

0
source

All Articles