Any recommendations for a small, lightweight, bagged word search engine?
I have a set of โdocumentsโ, each of which is a small bag of arbitrary words. Given the new document, I need to get a list of "similar" documents along with some weight for how similar they are. Documents are likely to be small .. a couple of paragraphs no more.
- Stemming will be great, but not very necessary.
- Word extension with layers is not required.
- It is recommended that you use openource or freeware, as this is a prototype, not a project with a full bang. Preferred platform
- unix / linux.
I would use it as a subcomponent and expect it to submit documents with an identifier, and then will search for similar documents for the one I have.
source share