Search Engine for Linguistic Corps

I am trying to find a good library for building a search engine with a linguistic body. Such an engine should provide an absolutely transparent search result (the exact number of matches found, without a result, even if the whole case matches), the basic query syntax (AND, OR, NOT operator, remote search, wildcard search) and the ability to refine documents set for search (i.e., trim setting). An important detail is the ability to separate indexes and perform searches in parallel (the case size is about 10 ^ 8 words, and the search service should be in real time).

The main choice between Sphinx and Clucene (C ++ Lucene port). Unfortunately, I know little about this organization of libraries, so it would be very useful to know which one is best suited to my requirements.

(I also tried a specialized engine - IMS Corpus Workbench), which turned out to be not as scalable as necessary).

+4
source share
1 answer

I would suggest setting up a SOLR server, which is a derivative of Lucene, and has a Restful interface. The new Lucene (SOLR) features are completely unparalleled in other colleagues. A body of 10 ^ 8 different words may be a concern, but I hope they are not clear. At least, according to my assumptions, this may lead to some loss in preliminary form. Carrying out the markup and search in parallel on the naked Lucene will be a vestigial effort. SOLR provides both functions. I do not know the Sphinx very well. But, of course, Lutsen and his derivatives are on the verge of bleeding.

+1
source

All Articles