I write a topology for reading topics from kKafka, and then do some aggregation and then save the result in a database. The topology worked fine for several hours, but then the worker died, and eventually the supervisor died too. This problem occurs every time after startup for several hours.
I run a storm 0.9.5 on 3 nodes (1 for nimbus, 2 for workers).
This is the error I received in one of the working logs:
2015-08-12T04:10:38.395+0000 bsmnClient [ERROR] connection attempt 101 to Netty-Client-/10.28.18.213:6700 failed: java.lang.RuntimeException: Returned channel was actually not established 2015-08-12T04:10:38.395+0000 bsmnClient [INFO] closing Netty Client Netty-Client-/10.28.18.213:6700 2015-08-12T04:10:38.395+0000 bsmnClient [INFO] waiting up to 600000 ms to send 0 pending messages to Netty-Client-/10.28.18.213:6700 2015-08-12T04:10:38.404+0000 STDIO [ERROR] Aug 12, 2015 4:10:38 AM org.apache.storm.guava.util.concurrent.ExecutionList executeListener SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@632ef20f with executor org.apache.stor m.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@ 1f15e9a8 java.lang.RuntimeException: Failed to connect to Netty-Client-/10.28.18.213:6700 at backtype.storm.messaging.netty.Client.connect(Client.java:308) at backtype.storm.messaging.netty.Client.access$1100(Client.java:78) at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297) at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283) at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275) at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181) at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145) at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91) at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:380) at java.util.concurrent.FutureTask.set(FutureTask.java:229) at java.util.concurrent.FutureTask.run(FutureTask.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-/10.28.18.213:6700 after 102 failed attempts at backtype.storm.messaging.netty.Client.connect(Client.java:303) ... 19 more
And this is my configuration for each working node:
storm.zookeeper.servers: - "10.28.19.230" - "10.28.19.224" - "10.28.19.223" storm.zookeeper.port: 2181 nimbus.host: "10.28.18.211" storm.local.dir: "/mnt/storm/storm-data" storm.local.hostname: "10.28.18.213" storm.messaging.transport: backtype.storm.messaging.netty.Context storm.messaging.netty.server_worker_threads: 1 storm.messaging.netty.client_worker_threads: 1 storm.messaging.netty.buffer_size: 5242880 storm.messaging.netty.max_retries: 300 storm.messaging.netty.max_wait_ms: 4000 storm.messaging.netty.min_wait_ms: 100 supervisor.slots.ports: - 6700 supervisor.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346 #worker.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=1%ID%" #supervisor.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346" worker.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=2%ID% -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx10240m -Xms10240m -XX:MaxNewSize=6144m
source share