Random disconnects from master node NoNodeAvailableException using Elastic Cloud / Found

I use a resilient cloud (formerly found) with a shield and transport java client. An application that interacts with ES works on the hero. I am doing a stress test in a single node staging environment

{ "cluster_name": ..., "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 19, "active_shards": 19, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 7, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0 } 

In the beginning, everything works fine. But after a while (3-4 minutes) I start to get some errors. I set the log level for tracking, and these are the errors I was getting (I replaced ... anything that doesn't matter.

 org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available: [[...][...][...][inet[...]]{logical_availability_zone=..., availability_zone=..., max_local_storage_nodes=1, region=..., master=true}] at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:242) at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:78) at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.transport.SendRequestTransportException: [...][inet[...]][indices:data/read/search] at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286) at org.elasticsearch.shield.transport.ShieldClientTransportService.sendRequest(ShieldClientTransportService.java:41) at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57) at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109) at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205) at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106) at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334) at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:416) at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1122) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65) ... Caused by: org.elasticsearch.transport.NodeNotConnectedException: [...][inet[...]] Node not connected at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936) at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629) at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276) ... 

These are my properties.

  settings = ImmutableSettings.settingsBuilder() .put("client.transport.nodes_sampler_interval", "5s") //Tried it with 30s, same outcome .put("client.transport.ping_timeout", "30s") .put("cluster.name", clusterName) .put("action.bulk.compress", false) .put("shield.transport.ssl", true) .put("request.headers.X-Found-Cluster", clusterName) .put("shield.user", user + ":" + password) .put("transport.ping_schedule", "1s") //Tried with 5s, same outcome .build(); 

I also asked for each request made:

 max_query_response_size=100000 timeout_seconds=30 

I use ElasticSearch 1.7.2 and Shield 1.3.2 with the corresponding (same version) clients, Java 1.8.0_65 on my machine - Java 1.8.0_40 on node.

I got the same errors without a stress test, but the errors occurred in a very random way, so I wanted to reproduce. That is why I run this in one node.

I noticed another error in my logs

 2016-03-07 23:35:52,177 DEBUG [elasticsearch[Vermin][transport_client_worker][T#7]{New I/O worker #16}] ssl.SslHandler (NettyInternalESLogger.java:debug(63)) - Swallowing an exception raised while writing non-app data java.nio.channels.ClosedChannelException at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 

Hot topics

 0.0% (111.6micros out of 500ms) cpu usage by thread 'elasticsearch[...][transport_client_timer][T#1]{Hashed wheel timer #1}' 10/10 snapshots sharing following 5 elements java.lang.Thread.sleep(Native Method) org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445) org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364) org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) java.lang.Thread.run(Thread.java:745) 

After reading this http://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/ I understood a little better how all communication works. I have not tested this yet, but I believe the problem is there. The fact is, though, even if I confirm that the problem is with closed query connections, how do I do this? Keep configuration as is and just reconnect? Disable keepAlive ? If so, should I worry about anything else?

+6
source share
1 answer

Referring to this link: https://discuss.elastic.co/t/nonodeavailableexception-with-java-transport-client/37702 Conrad Beisk

Your application can resolve the ip address at boot time. ELB can change ip at any given time. For best reliability, your application should add all ELB ip to the client and periodically check the DNS service for changes.

Our ELB connection timeout is 5 minutes.

The following should help you fix:

Creating a new TransportClient for each request is not ideal, as it will mean a new connection for each request, and this will damage your response time. You may have a TransportClients pool if you prefer, but this will most likely be an unnecessary overhead client is a safe thread.

My suggestion is that you create a small singleton service that periodically checks for changes in the DNS service and adds new ip to the existing transport client. Theoretically, this could be naive as simply adding all the ip open every time it checks how the transport client will drop duplicate addresses, and also removes old addresses that are no longer available.

+3
source

All Articles