Restarting an elicsearch 2-node cluster with zero downtime

Question

Restarting an elicsearch 2-node cluster with zero downtime

We used the ES <2> w630> cluster (now on 1.4.1 ) with all the default values and the following overrides:

config.cluster.name = "..."
config.discovery.zen.ping_timeout = "5s";
config.discovery.zen.ping.multicast.enabled = false;
config.discovery.zen.ping.unicast.hosts = ["IP1", "IP2"];

Recently, we began to notice that when we complete each node with a query http://127.0.0.1:9200/_cluster/nodes/_local/_shutdown, the cluster becomes immune to 30 seconds .

When the node wizard is explicitly disabled, the other node does not seem to immediately resume the wizard role ... instead, it continues trying until 30 seconds elapse (by default discovery.zen.fd.ping_timeout).

During this period, the cluster has a hostless block and returns 503 to the root node request:

{
"status" : 503,
"name" : "...",
"cluster_name" : "...",
"version" : {
"number" : "1.4.1",
"build_hash" : "89d3241d670db65f994242c8e8383b169779e2d4",
"build_timestamp" : "2014-11-26T15:49:29Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

Block Levels: ["write", "metadata"].

You can see this reproduction in the logs:

[2014-12-04 17:46:16,000][INFO ][discovery.zen            ] [NODE_1] master_left [[NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]]], reason [shut_down]
[2014-12-04 17:46:16,012][WARN ][discovery.zen            ] [NODE_1] master left (reason = shut_down), current nodes: {[NODE_1][WoVynRBhQvSwvxNp1nj8kw][RD000D3A109006][inet[/100.78.140.38:9300]],}
[2014-12-04 17:46:16,012][INFO ][cluster.service          ] [NODE_1] removed {[NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]],}, reason: zen-disco-master_failed ([NODE_0][VXtqWIw2Q2C9b5UHvWlZyQ][RD000D3A1024B8][inet[/100.72.14.37:9300]])
[2014-12-04 17:46:16,497][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:18,358][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:19,508][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:20,384][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:21,150][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:21,915][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:22,540][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:23,384][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:23,900][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:24,572][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:26,794][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:27,783][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:28,441][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:29,330][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:30,393][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:31,264][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:31,905][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:32,608][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:35,572][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:36,529][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:37,295][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:37,911][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:38,661][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:39,411][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:40,032][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:40,643][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:41,505][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:41,927][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:42,630][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:43,380][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:44,193][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:44,963][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:45,824][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:46,511][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:46,574][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:47,278][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:48,028][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:48,373][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:48,811][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:49,530][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:49,530][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:50,155][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:50,405][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:51,030][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:51,170][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:51,889][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:51,938][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:52,530][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:52,561][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:53,406][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:53,596][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:53,908][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:54,353][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:54,587][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:55,056][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:55,712][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:56,322][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:56,806][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:56,962][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:57,791][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:57,806][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:58,228][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:58,447][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:59,088][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:46:59,355][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:46:59,775][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:47:00,400][DEBUG][action.admin.cluster.state] [NODE_1] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s]
[2014-12-04 17:47:00,400][DEBUG][action.admin.cluster.state] [NODE_1] no known master node, scheduling a retry
[2014-12-04 17:47:00,619][INFO ][cluster.service          ] [NODE_1] new_master [NODE_1][WoVynRBhQvSwvxNp1nj8kw][RD000D3A109006][inet[/100.78.140.38:9300]], reason: zen-disco-join (elected_as_master)

node shutdown, node 30- ? "" , .

+4

elasticsearch

Nariman 07 . '14 17:28

1

chani · Answer 1 · 2014-12-27T00:44:31+0000

discovery.zen.rejoin_on_master_gone false .
true, , node , . false, node , , ( , discovery.zen.minimum_master_nodes 2).

, : Elasticsearch 1.4 , . ( ), .

Restarting an elicsearch 2-node cluster with zero downtime

More articles: