I have a two-node Elasticsearch cluster. The website (live) directly uses this cluster, constantly performing search and index queries in my ES cluster.
My problem is that on a regular (and unpredictable) basis, the entire cluster becomes inaccessible when one of the nodes releases the garbage collector . The message I get from node log looks like
[2015-07-01 06:43:19,525][INFO ][monitor.jvm] [my_node] [gc][old][205450][116] duration [5.7s], collections [1]/[6.3s], total [5.7s]/[1m], memory [22.3gb]->[4.9gb]/[30.9gb], all_pools {[young] [392.9mb]->[17.2mb]/[665.6mb]} {[survivor] [29.1mb]->[0b]/[83.1mb]} {[old] [21.9gb]->[4.9gb]/[30.1gb]}
From what I understand (I'm not a Java person), these lines indicate that the ES is emptying the garbage collector. So, during these 5.7 seconds, node is not responding, neither my cluster nor my site . This downtime occurs 5 to 10 times a day.
Am I doing something wrong here or is this downtime inevitable? Should I add an Elasticsearch load balancer (i.e. A node with data = false, master = false) to the cluster and point my site to this loadbalancer? Or should I add another kind of load balancing (HAProxy?) In front of my nodes? Or does this mean that something is wrong with the servers, the data?
Thank you very much in advance
Some cluster configuration information
- Elasticsearch 1.6.0 cluster of 2 nodes (5 shards, 1 replica)
- The cluster contains ~ 10 million documents, occupying ~ 30 Gb.
- Each node is a server with 64 GB of RAM with MAX_HEAP_SIZE set to 31g
- The website launches ~ 300 search queries per second and ~ 100 index queries per second
- JVM heap utilization is always between 50% and 75%, never higher
source share