Recommendations for a simple search engine for a bag of words?

Any recommendations for a small, lightweight, bagged word search engine?

I have a set of โ€œdocumentsโ€, each of which is a small bag of arbitrary words. Given the new document, I need to get a list of "similar" documents along with some weight for how similar they are. Documents are likely to be small .. a couple of paragraphs no more.

  • Stemming will be great, but not very necessary.
  • Word extension with layers is not required.
  • It is recommended that you use openource or freeware, as this is a prototype, not a project with a full bang. Preferred platform
  • unix / linux.

I would use it as a subcomponent and expect it to submit documents with an identifier, and then will search for similar documents for the one I have.

+4
source share
4 answers

Whoosh is a pure Python indexing / search system (no C, no external database). See the documentation for more information. It supports support.

I tried this on the XML dump of the mediawiki instance, and it seemed to work very well!

+1
source

Solr or Sphinx , They are not very light, but I would not recommend anything less if the project is successful and it needs to grow by switching the search engine, it can be painful.

0
source

I think Lucene is an option. This should allow you to create a custom word search package.

0
source

I'm curious about MongoDB http://www.mongodb.org/display/DOCS/Home

It seems that a โ€œfull-text searchโ€ might be what I need ... and having additional search fields might be convenient.

0
source

All Articles