Grizzly Pipe Leak - What am I doing wrong?

Question

Grizzly Pipe Leak - What am I doing wrong?

I wrote the following test code:

@Test public void testLeakWithGrizzly() throws Throwable { ExecutorService executor = Executors.newFixedThreadPool(N_THREADS); Set<Future<Void>> futures = new HashSet<>(); InetSocketAddress inetSocketAddress = new InetSocketAddress(localhostAddress, 111); for (int i = 0; i < N_THREADS; i++) { Future<Void> future = executor.submit(new GrizzlyConnectTask(inetSocketAddress, requests, bindFailures, successfulOpens, failedOpens, successfulCloses, failedCloses)); futures.add(future); } for (Future<Void> future : futures) { future.get(); //block } Thread.sleep(1000); //let everything calm down reporter.report(); throw causeOfDeath; } private static class GrizzlyConnectTask implements Callable<Void> { private final InetSocketAddress address; private final Meter requests; private final Meter bindFailures; private final Counter successfulOpens; private final Counter failedOpens; private final Counter successfulCloses; private final Counter failedCloses; public GrizzlyConnectTask(InetSocketAddress address, Meter requests, Meter bindFailures, Counter successfulOpens, Counter failedOpens, Counter successfulCloses, Counter failedCloses) { this.address = address; this.requests = requests; this.bindFailures = bindFailures; this.successfulOpens = successfulOpens; this.failedOpens = failedOpens; this.successfulCloses = successfulCloses; this.failedCloses = failedCloses; } @Override public Void call() throws Exception { while (!die) { TCPNIOTransport transport = null; boolean opened = false; try { transport = TCPNIOTransportBuilder.newInstance().build(); transport.start(); transport.connect(address).get(); //block opened = true; successfulOpens.inc(); //successful open requests.mark(); } catch (Throwable t) { //noinspection ThrowableResultOfMethodCallIgnored Throwable root = getRootCause(t); if (root instanceof BindException) { bindFailures.mark(); //ephemeral port exhaustion. continue; } causeOfDeath = t; die = true; } finally { if (!opened) { failedOpens.inc(); } if (transport != null) { try { transport.shutdown().get(); //block successfulCloses.inc(); //successful close } catch (Throwable t) { failedCloses.inc(); System.err.println("while trying to close transport"); t.printStackTrace(); } } else { //no transport == successful close successfulCloses.inc(); } } } return null; } }

on my linux laptop, this will work after ~ 5 minutes with the following exception:

 java.io.IOException: Too many open files at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method) at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:68) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36) at org.glassfish.grizzly.nio.Selectors.newSelector(Selectors.java:62) at org.glassfish.grizzly.nio.SelectorRunner.create(SelectorRunner.java:109) at org.glassfish.grizzly.nio.NIOTransport.startSelectorRunners(NIOTransport.java:256) at org.glassfish.grizzly.nio.NIOTransport.start(NIOTransport.java:475) at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:137) at net.radai.LeakTest$GrizzlyConnectTask.call(LeakTest.java:111) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

success / failure counters are as follows:

 -- Counters -------------------------------------------------------------------- failedCloses count = 0 failedOpens count = 40999 successfulCloses count = 177177 successfulOpens count = 136178 -- Meters ---------------------------------------------------------------------- bindFailures count = 40998 mean rate = 153.10 events/second 1-minute rate = 144.61 events/second 5-minute rate = 91.12 events/second 15-minute rate = 39.56 events/second requests count = 136178 mean rate = 508.54 events/second 1-minute rate = 547.38 events/second 5-minute rate = 442.76 events/second 15-minute rate = 391.53 events/second

which tells me that:

there were no major failures
all connections were either not created or were successfully closed (136178 + 40999 = 177177)
all failures in the opening were exhausted for the ephemeral port, with the exception of the latter (40999 = 40998 + 1)

full github code here - https://github.com/radai-rosenblatt/oncrpc4j-playground/blob/master/src/test/java/net/radai/LeakTest.java

So am I somehow abusing the grizzly API, or is this a real leak? (note - im using grizzly 2.3.12, which, as I know, is not the latest), the update would require convincing people, so I want to be sure that this is not a user error at my end)

EDIT - this thing flows even when nothing is thrown away. cutting one thread and placing 2 ms of sleep, 800 pipes still seep for 50 minutes.

+7

java grizzly

radai Aug 15 '15 at 13:42

source share

2 answers

We found the actual underlying problem in Grizzly and fixed it.

The root of the problem, based on the test case, is called Transport.stop () at a point early enough to execute SelectorRunner.run (), which will cause the start method to complete earlier (due to the StateHolder being stopped at that point) .

In addition, since SelectorRunner.run () CAS changes the state of selector activity at the beginning of the run () method, the thread calling Transport.stop () sees the selector as active. Because of these two conditions, SelectorRunner.shutdownSelector () is never called, and therefore we will use a selector.

A fix will be available in the evening.

+3

rlubke 10 Sep '15 at 23:22

source share

sibnick · Accepted Answer · 2015-08-20T12:55:45+0000

I find the problem deep inside the grizzly. This is an internal multithreading issue (race condition). File descriptors flow with the class sun.nio.ch.EPollSelectorImpl. Each instance contains 3 file descriptors (2 per channel and 1 for epoll_create syscall). Grizzly sends a close / end in the SelectorRunner class:

  public synchronized void stop() { stateHolder.set(State.STOPPING); wakeupSelector(); // we prefer Selector thread shutdown selector // but if it not running - do that ourselves. if (runnerThreadActivityCounter.compareAndSet(0, -1)) { // The thread is not running shutdownSelector(); } }

Usually everything is fine, but sometimes the selector never wakes up. The Wakeup method dispatches an interrupt using the native sun.nio.ch.EPollArrayWrapper#interrupt(int) method. It has a simple implementation:

 JNIEXPORT void JNICALL Java_sun_nio_ch_EPollArrayWrapper_interrupt(JNIEnv *env, jobject this, int fd) { int fakebuf[1]; fakebuf[0] = 1; if (write(fd, fakebuf, 1) < 0) { JNU_ThrowIOExceptionWithLastError(env,"write to interrupt fd failed"); } }

This way it just sends one byte to wake up the wait selector. But you close the transport immediately after creation. This rarely happens in real life, but in your test case it happens regularly. Sometimes grizzly calls NIOConnection.enableIOEvent after closing and waking / interrupting. I think in this case the selectors never wake up and never release the file descriptors.

Currently, I can only offer a fix for this situation: use the timer task to directly call selector.close after some timeout:

 //hotfix code bellow private static final Timer timer = new Timer(); //hotfix code above protected synchronized void stopSelectorRunners() { if (selectorRunners == null) { return; } for (int i = 0; i < selectorRunners.length; i++) { SelectorRunner runner = selectorRunners[i]; if (runner != null) { runner.stop(); //hotfix code below final Selector selector = runner.getSelector(); if(selector !=null) { timer.schedule(new TimerTask() { @Override public void run() { try { selector.close(); } catch (IOException e) { } } }, 100); } //hotfix code above selectorRunners[i] = null; } } selectorRunners = null; }

I can stop leaks after adding this value to org.glassfish.grizzly.nio.NIOTransport#stopSelectorRunners

Grizzly Pipe Leak - What am I doing wrong?

More articles: