I use a resilient cloud (formerly found) with a shield and transport java client. An application that interacts with ES works on the hero. I am doing a stress test in a single node staging environment
{ "cluster_name": ..., "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 19, "active_shards": 19, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 7, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0 }
In the beginning, everything works fine. But after a while (3-4 minutes) I start to get some errors. I set the log level for tracking, and these are the errors I was getting (I replaced ... anything that doesn't matter.
org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available: [[...][...][...][inet[...]]{logical_availability_zone=..., availability_zone=..., max_local_storage_nodes=1, region=..., master=true}] at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:242) at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:78) at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.transport.SendRequestTransportException: [...][inet[...]][indices:data/read/search] at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286) at org.elasticsearch.shield.transport.ShieldClientTransportService.sendRequest(ShieldClientTransportService.java:41) at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57) at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109) at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205) at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106) at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334) at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:416) at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1122) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65) ... Caused by: org.elasticsearch.transport.NodeNotConnectedException: [...][inet[...]] Node not connected at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936) at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629) at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276) ...
These are my properties.
settings = ImmutableSettings.settingsBuilder() .put("client.transport.nodes_sampler_interval", "5s") //Tried it with 30s, same outcome .put("client.transport.ping_timeout", "30s") .put("cluster.name", clusterName) .put("action.bulk.compress", false) .put("shield.transport.ssl", true) .put("request.headers.X-Found-Cluster", clusterName) .put("shield.user", user + ":" + password) .put("transport.ping_schedule", "1s") //Tried with 5s, same outcome .build();
I also asked for each request made:
max_query_response_size=100000 timeout_seconds=30
I use ElasticSearch 1.7.2 and Shield 1.3.2 with the corresponding (same version) clients, Java 1.8.0_65 on my machine - Java 1.8.0_40 on node.
I got the same errors without a stress test, but the errors occurred in a very random way, so I wanted to reproduce. That is why I run this in one node.
I noticed another error in my logs
2016-03-07 23:35:52,177 DEBUG [elasticsearch[Vermin][transport_client_worker][T#7]{New I/O worker #16}] ssl.SslHandler (NettyInternalESLogger.java:debug(63)) - Swallowing an exception raised while writing non-app data java.nio.channels.ClosedChannelException at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Hot topics
0.0% (111.6micros out of 500ms) cpu usage by thread 'elasticsearch[...][transport_client_timer][T#1]{Hashed wheel timer #1}' 10/10 snapshots sharing following 5 elements java.lang.Thread.sleep(Native Method) org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445) org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364) org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) java.lang.Thread.run(Thread.java:745)
After reading this http://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/ I understood a little better how all communication works. I have not tested this yet, but I believe the problem is there. The fact is, though, even if I confirm that the problem is with closed query connections, how do I do this? Keep configuration as is and just reconnect? Disable keepAlive ? If so, should I worry about anything else?