Solr caching with EHCache / BigMemory

We are implementing a large Lucene / Solr customization with documents in excess of 150 million. We will also have mid-size document updates every day.

My question is really two-part:

What are the consequences of using a different caching implementation in Solr, i.e. EHCache instead of native Solr LRUCache / FastLRUCache?

Terracotta has announced BigMemory, designed to be used in conjunction with EHCache as a cache outside the processor. According to TC, this allows you to store large amounts of data without the overhead of GC on the JVM. Is this a good idea to use with Solr? Does this really help?

I would like to. how to hear from people with real production experience with EHCache / BigMemory and / or Solr Cache settings.

+6
garbage-collection lucene solr ehcache
source share
2 answers

A lot of thoughts on this topic. Although my answer has no effect on EhCache.

First of all, I don’t think that documents should be stored in your search index. Search content should be stored there, not the entire document. What I mean is that what returned from your search should be document IDs. Not the content of the documents themselves. The documents themselves must be saved and retrieved from the second system, possibly for the original file storage from which they are indexed. This will reduce the size of the index, reduce the cache size of your document, reduce the replication time of the master slave (this can become a bottleneck if you update frequently) and reduce the overhead of writing search answers.

Then consider installing a reverse HTTP proxy before Solr. Although query caches allow Solr to respond quickly, a cache such as Varnish sitting in front of Solr is even faster. This offloads Solr, allowing him to spend time on queries that he has not seen before. The second effect is that now you can throw most of your memory into cache files instead of query caches. If you followed my first sentence, your documents will be incredibly small, which will allow you to store most, if not all of them in memory.

Quick envelope calculation for document sizes. I can easily provide a 32-bit int as an identifier for 150 million documents. I still have 10x margin for document growth. 150 million identifiers occupy 600 MB. Add Solr to your fax documents, and you can probably easily download all your Solr documents in 1-2GB. Given that getting 12GB-24GB or RAM these days is easy, and I would say that you can do it all on one box and get incredible performance. Nothing extra is needed like EhCache. Just make sure you use your search index as efficiently as possible.

As for the GC: I have not seen much of the GC time spent on my Solr servers. Most of what needed to be collected was very short-lived objects associated with an HTTP request and response loop that never leaves the eden space. With the right cache settings, there wasn’t much traffic. The only big changes occurred when the new index was loaded and the caches were flushed, but this did not happen all the time.

EDIT: for the background, I spent a lot of time setting up Solr caching for a large company that sells consoles and serves millions of requests per day from its Solr servers.

+7
source share

I'm not sure anyone else has tried this. Of course, we would like to chat with the Solr guys to find out how useful this would be. We could even optimize it for use.

0
source share

All Articles