How to improve your keyword search performance?

Question

How to improve your keyword search performance?

This is an interview question.

You need to write a program that will find all files containing all the given keywords. How would you preprocess files to improve search performance.

My answer:

I would use Lucene (or any other text search engine). If I need to execute it manually, I will create an index that matches the words of the document with the identifiers of the documents. We should probably implement this index using B-trees . An alternative is to use an RDBMS ( mySQL or smth.), But for me it seems redundant.

Does it make sense? How would you answer this question?

+4

algorithm data-structures indexing full-text-search

Michael Mar 03 '13 at 13:38

source share

1 answer

Karoly Horvath · Accepted Answer · 2013-03-03T14:02:13+0000

I agree, in most cases, a text search engine is the way to go ... very easy to build and reliable. A little more detailed information here: most engines do OR search by default, so you need to specify that you want to combine all words.

If you need to create your own solution, yes, obviously, you need to build comparisons. I would use a hash search rather than a tree index, but your tree does not seem to be too large, so there is only a slight performance improvement. However, I see no reason to use a tree, you do not need traversal functions, you will never look for the previous or next word.

More interesting details come up when you really check how you will use your data structure. Take an example of a search: The pony he comes . Intuitively, you did not start the search with the , perhaps all documents contain it (provided that they are English texts). pony is a good choice, and you can easily narrow your search. Most text search engines contain a metric for this: how many documents this particular word contains. Therefore, based on this, you start with the least frequent and check the words in order of increasing frequency.

As soon as you manage to narrow your search, you begin to realize that your index does not work very well ... you still have the word the to check, and in your index that displays a million documents, so this point would be better to use reverse mapping, from document to words (again, hash search or trie). You check several documents to see if they contain the remaining words.

Note. There are many solutions (how to save the display, simple or double matching, btree / hash / trie / ...) depends on the scale of the projects. Abundantly you create something simple if you need to search in several files and create something else if you need to index all files on github or look for a sequence of genes where even the index may not fit into memory ...

How to improve your keyword search performance?

More articles: