Search Engines - Best Practices

A common practice is not to index the so-called stop words when analyzing documents for a search engine. Stop words are common words like a , the and this that often appear in the language. The idea is that if you index stop words, they take up too much space in the index and add little to the quality of the search results.

I would like to know if this is always the case.

In modern search engines, indexing stop words makes the size of the index explode? Or is it just a slight increase.

Also, how does deleting stop words affect phrase searches? The search for the Beatles and the Beatles seems to be two very different.

I am building an application with elasticsearch, but this question is equally applicable to Solr, direct lucene or any other option.

+4
source share
2 answers
  • The main problem with stop words is not the size of the index, but the quality of the response. They tend to dominate (have a very high tf value and, therefore, can lead to incorrect results), and not the size of the index.
    In any case, the words stopping the indexing do not significantly increase the size of the index (and it definitely does not β€œexplode”)

  • One way to overcome this is to use stop words (and not omit them completely) when indexing n-gram . I don't know if this is actually being done, but it can definitely help improve the returned results.

Also: stopping words is not always * omitted. For example, in sarcasm detectors - it seems (empirically) stop words are very important for the answer.

+4
source

I think all search engines handle this differently. You can read about it: http://searchenginewatch.com

But if you are just one guy who is building a (small) application, I don’t think you should focus on these small details and just leave these words and focus on more important words.

0
source

All Articles