Algorithm for matching one input file with given file numbers

I had an interview last week. I am stuck in one of the questions in the algorithm. I answered this question, but the interviewer did not seem convinced. That is why I share the same thing.

Please tell me some optimized method for this question so that it helps me in future interviews.

Question : -

There are 20 text files, all files are ASCII text files that are less than 10 ^ 9 bytes in size. There is also one input, it is also one ASCII file, say, input.txt.

Our task is to strategically coordinate the contents of this input file with 20 files given and print the name of the nearest matching file. the contents of the input file can only match part

Thanks in advance. Look for your kind answer.

+7
source share
3 answers

separate them and go through wc -l or implement Levenshtein distance in C ++, treating each line as one character (or a more suitable unit for determining the domain)

+3
source

You can create some indexing (example: trie) to sum the input file. Then you can check how many indexes match between documents.

Eg. Create a trie for an input file of length 10. For each line of length 10 (overlap) in text files, check how many of them match in trie.

+1
source

As a proposal for developing truly capable, scalable systems for document similarity, I would suggest reading Chapter 3 , Massive Arrays of Arrays , which is freely available online. One approach presented here is to β€œshingle” the data by vectorizing word counts in sets, then hashing those words and comparing the hash result families with similarities to Jaccard to get an estimate between all documents. This can work with petabytes of files with high precision, if done correctly. Explicit data with good charts can be read from the Stanford CS246 Location Sensitivity Slides . Simpler approaches, such as counting word frequency, are also described in the book.

0
source

All Articles