Lucene.NET clustering options?

I am interested in running Lucene.NET for an application that runs on Windows clusters. The search problem itself is small enough, but the stateless / cluster problem still needs to be handled.

I understand that SOLR is processing my script (and much more), but there are some problems for the servlet container (and Java). Depending on the complexity of the Lucene.NET-based approach, it may still be a bottle option.

Now I have a question, what parameters do I have to work with the problem of working on multiple hosts:

  • Saving to shared storage common to all nodes? Will Lucene.NET handle concurrency transparently? Will the servers use RAM for caching, and if so, does Lucene.NET handle the invalidity of this based on the updated files transparently?

  • Replication? Each server has its own copy of everything that it needs. On any upgrade, all servers receive a new replica (or diff, if it's simple enough). Existing tools for this, or up to me to handle?

  • Sharing / edging workload? Each server processes only its data, both for reading and for updates. Tools for handling this, combining partial results, etc.

  • Other options that I might have missed in my initial research?

When experimenting with the local version, my Lucene directory was on the order of several hundred megabytes. In the long run, I can see 1-5 GB. If the frequency of updates is a difficulty, I can control it quite flexibly. Simultaneous read / search loads are expected to be very moderate.

+7
source share
1 answer

You can use lucene.net with multiple servers, but you need to implement an index server.

All changes you make must be queued and from time to time index pending documents. Also, you should immediately index if x elements are in the queue (x depends on the settings of the merge documents, for me it was 25,000).

The above reason is that you need to avoid making small changes to the index, as this will lead to reduced overtime performance due to the creation of a large number of small files. Uou can run 2 index servers, but only 1 will be indexed at a time by locking by index, the only reason for this is a break, if the first one goes down, it depends on your needs.

I used a 15 GB index with 30 million records. The script that I had with this was under azure.

  • 1 worker role to index changes

  • 2 to 20 web roles serving content, each of which contains an index.

Changes were pushed every 15 minutes, and the index was combined with 25,000 changes and each combined index contained 250,000 documents. Each web server checked the memory store for changes every 15 minutes and blocked the index reader, which was then canceled if the changes were downloaded. Maximum documents per file are mainly for stopping web servers loading many previous changes.

I used Lucene.AzureDirectory to start, but it was not able to reliably detect modified drops in the blob repository, so I ended up repeating the drops and comparing them locally and loading as needed.

Now would I implement something like this again? the answer is no. I would use elasticsearch or solr since you are reinventing the wheel.

0
source

All Articles