CoreOS, Fleet, and Etcd2 Failure Resilience

I have a 23 node cluster working with CoreOS Stable 681.2.0 on AWS through 4 availability zones. All nodes are running, etc. 2 and flannel. Of the 23 nodes 8, etcd2 nodes are allocated, the rest are specifically designated as proxies etcd2.

Scheduled for the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS and 4 of our application containers. Application containers are registered with etcd2, and nginx containers collect any changes, display the necessary files, and finally reload.

All this works fine until, for some reason, the single etcd2 node is available.

If a group of voting members etcd2 loses contact with the same other voting member, etc., all services scheduled for the fleet become unstable. Scheduled services start to stop and start without my intervention.

As a test, I started stopping EC2 instances that hosted the voting for the etcd2 nodes until a quorum was lost. After the first etcd2 node was stopped, the above symptoms started. After the second node, services became unstable, without visible changes. Then, after the third was stopped, the quorum was lost, and all units were unplanned. Then I started all three etcd2 nodes again, and within 60 seconds the cluster returned to a stable state.

Subsequent tests give identical results.

Am I caught a known bug in etcd2, fleet or CoreOS?

, , node, etcd - ?

+4
1

. , 1 , . - , .

, , , - etcd vs etcd2. etcd.service unit, ( , ) CoreOS etcd.service, etcd2.service . , , ..

- , , etcd etcd2, , , .

+2

All Articles