Storm worker dies after work for 2 hours

I write a topology for reading topics from kKafka, and then do some aggregation and then save the result in a database. The topology worked fine for several hours, but then the worker died, and eventually the supervisor died too. This problem occurs every time after startup for several hours.

I run a storm 0.9.5 on 3 nodes (1 for nimbus, 2 for workers).

This is the error I received in one of the working logs:

2015-08-12T04:10:38.395+0000 bsmnClient [ERROR] connection attempt 101 to Netty-Client-/10.28.18.213:6700 failed: java.lang.RuntimeException: Returned channel was actually not established 2015-08-12T04:10:38.395+0000 bsmnClient [INFO] closing Netty Client Netty-Client-/10.28.18.213:6700 2015-08-12T04:10:38.395+0000 bsmnClient [INFO] waiting up to 600000 ms to send 0 pending messages to Netty-Client-/10.28.18.213:6700 2015-08-12T04:10:38.404+0000 STDIO [ERROR] Aug 12, 2015 4:10:38 AM org.apache.storm.guava.util.concurrent.ExecutionList executeListener SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@632ef20f with executor org.apache.stor m.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@ 1f15e9a8 java.lang.RuntimeException: Failed to connect to Netty-Client-/10.28.18.213:6700 at backtype.storm.messaging.netty.Client.connect(Client.java:308) at backtype.storm.messaging.netty.Client.access$1100(Client.java:78) at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297) at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283) at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275) at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181) at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145) at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91) at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:380) at java.util.concurrent.FutureTask.set(FutureTask.java:229) at java.util.concurrent.FutureTask.run(FutureTask.java:270) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-/10.28.18.213:6700 after 102 failed attempts at backtype.storm.messaging.netty.Client.connect(Client.java:303) ... 19 more 

And this is my configuration for each working node:

 storm.zookeeper.servers: - "10.28.19.230" - "10.28.19.224" - "10.28.19.223" storm.zookeeper.port: 2181 nimbus.host: "10.28.18.211" storm.local.dir: "/mnt/storm/storm-data" storm.local.hostname: "10.28.18.213" storm.messaging.transport: backtype.storm.messaging.netty.Context storm.messaging.netty.server_worker_threads: 1 storm.messaging.netty.client_worker_threads: 1 storm.messaging.netty.buffer_size: 5242880 storm.messaging.netty.max_retries: 300 storm.messaging.netty.max_wait_ms: 4000 storm.messaging.netty.min_wait_ms: 100 supervisor.slots.ports: - 6700 supervisor.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346 #worker.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=1%ID%" #supervisor.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346" worker.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=2%ID% -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx10240m -Xms10240m -XX:MaxNewSize=6144m 
+4
source share
1 answer

I think I found the main reason in this thread: https://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/% 3C20140214170209.GC55319@animetrics.com % 3E

A netty client error is only symptoms, but β€œthe main reason was that GC registration for workers was enabled without specifying the output file. As a result, GC registration went to standard output without redirecting to logback. In the end, the buffer was full and the JVM crashed (and thus stopped the heartbeat) and was killed. The amount of time the employee will work depends on the memory pressure and the size of the heap (obviously, in hindsight, the more GC, the larger the GC log, the faster the buffer will fill up )

Hope this helps you guys.

+6
source

All Articles