Recommendations for a simple search engine for a bag of words?

Question

Recommendations for a simple search engine for a bag of words?

Any recommendations for a small, lightweight, bagged word search engine?

I have a set of “documents”, each of which is a small bag of arbitrary words. Given the new document, I need to get a list of "similar" documents along with some weight for how similar they are. Documents are likely to be small .. a couple of paragraphs no more.

Stemming will be great, but not very necessary.
Word extension with layers is not required.
It is recommended that you use openource or freeware, as this is a prototype, not a project with a full bang. Preferred platform
unix / linux.

I would use it as a subcomponent and expect it to submit documents with an identifier, and then will search for similar documents for the one I have.

+4

search tags full-text-search tagging

ericslaw 21 sept '09 at 10:55

source share

4 answers

Steven kryskalla · Answer 1 · 2009-09-21T23:30:50+0000

Whoosh is a pure Python indexing / search system (no C, no external database). See the documentation for more information. It supports support.

I tried this on the XML dump of the mediawiki instance, and it seemed to work very well!

Mauricio Scheffer · Answer 2 · 2009-09-21T23:12:41+0000

Solr or Sphinx , They are not very light, but I would not recommend anything less if the project is successful and it needs to grow by switching the search engine, it can be painful.

Pascal thivent · Answer 3 · 2009-09-21T23:13:57+0000

I think Lucene is an option. This should allow you to create a custom word search package.

ericslaw · Answer 4 · 2009-09-22T01:45:45+0000

I'm curious about MongoDB http://www.mongodb.org/display/DOCS/Home

It seems that a “full-text search” might be what I need ... and having additional search fields might be convenient.

Recommendations for a simple search engine for a bag of words?

More articles: