Immune actor system: ThreadPoolExecutor only creates a pool of kernel threads, apparently ignores the maximum thread pool size

Update: I found that my program remains responsive if I set the ThreadPoolExecutor's pool size to the maximum pool size (29 threads). However, if I set the size of the main pool to 11 and the maximum pool size to 29, then the actors system will only create 11 threads. How can I configure ActorSystem / ThreadPoolExecutor to continue to create threads to exceed the number of kernel threads and stay within the maximum number of threads? I would prefer not to set the count of the main thread to the maximum number of threads, since I only need additional threads to cancel the job (which should be a rare event).


I have a batch program that works with an Oracle database, implemented using Java / Akka with typical actors with the following members:

  • BatchManager is responsible for managing the REST controller. It manages a Queue uninitialized batch jobs; when an uninitialized batch job is queued, it is converted to a JobManager actor and executed.
  • JobManager supports the stored procedure queue and the Workers pool; it initializes each Worker stored procedure, and when a Worker finishes, it sends the result of the procedure to the JobManager , and the JobManager sends another stored procedure to the Worker . The batch ends when the job queue is empty and all Workers inactive, after which the JobManager reports its results to the BatchManager , terminates the work of its employees (via TypedActor.context().stop() ), and then disconnects. JobManager has a Promise<Status> completion that terminates when the job completes, or when the job terminates due to a cancellation or fatal exception.
  • Worker executes the stored procedure. It creates an OracleConnection and CallableStatement is used to execute stored procedures and registers the onFailure with JobManager.completion on the abort connection and cancel statement. This callback does not use the actor’s system execution context; instead, it uses the execution context created from the cached executor service created in BatchManager .

My configuration

 {"akka" : { "actor" : { "default-dispatcher" : { "type" : "Dispatcher", "executor" : "default-executor", "throughput" : "1", "default-executor" : { "fallback" : "thread-pool-executor" } "thread-pool-executor" : { "keep-alive-time" : "60s", "core-pool-size-min" : coreActorCount, "core-pool-size-max" : coreActorCount, "max-pool-size-min" : maxActorCount, "max-pool-size-max" : maxActorCount, "task-queue-size" : "-1", "task-queue-type" : "linked", "allow-core-timeout" : "on" }}}}} 

The number of workers is set up elsewhere, currently workerCount = 8 ; coreActorCount workerCount + 3 , and maxActorCount is workerCount * 3 + 5 . I am testing this on a Macbook Pro 10 with two cores and 8 GB of memory; production server is much larger. The database I'm talking about is behind an extremely slow VPN. I use all this using Oracle JavaSE 1.8 JVM. The local server is Tomcat 7. Oracle JDBC drivers are version 10.2 (I could convince you to use a newer version). All methods return void or Future<> and must be non-blocking.

When one batch is completed successfully, there is no problem - the next batch begins immediately with a full set of workers. However, if I end the current batch through JobManager#completion.tryFailure(new CancellationException("Batch cancelled")) , then the onFailure registered with Workers will go out, and then the system will stop responding. Debug printlns show that a new batch starts with three out of eight working workers, and the BatchManager becomes completely unresponsive (I added the Future<String> ping command, which simply returns Futures.successful("ping") , and that also disconnects). onFailure are performed in a separate thread pool, and even if they were in the actor system thread pool, I must have a sufficiently high max-pool-size to accommodate the original JobManager , its Workers , its onFailure callbacks, and the second JobManager and Workers . Instead, I seem to be hosting the original JobManager and its Workers , the new JobManager and less than half of its Workers , and there is nothing left for the BatchManager. There are not enough resources on the computer on which I run this, but it seems that it should be able to run more than a dozen threads.

Is this a configuration problem? Is this related to the JVM limit and / or limit set by Tomcat? Is it due to a problem with how I handle I / O lock? Perhaps there are a few other things that I could do wrong, that is what came to mind.

Gist of CancellableStatement where CallableStatement and OracleConnection canceled

Invincible is created where CancellableStatements created

Contains JobManager cleanup code

Config dump obtained using System.out.println(mergedConfig.toString());


Edit: I believe that I narrowed down the problem for the actors system (either its configuration, or its interaction with blocking database calls). I removed the Worker members and transferred their workload to Runnables , which run at a fixed ThreadPoolExecutor size, where each JobManager creates its own ThreadPoolExecutor and closes it when the package completes ( shutDown on normal termination, shutDownNow on exceptional termination). Revocation is performed in the cached thread pool created in BatchManager . The actor system manager still remains ThreadPoolExecutor , but only has half a dozen threads. Using this alternative configuration, the cancellation is performed as expected β€” workers are interrupted when their connections to the database are interrupted, and the new JobManager is executed immediately with a full set of workflows. This indicates that this is not a problem with the / JVM / Tomcat hardware.


Update: I dumped the stream using Eclipse Memory Analyzer . I found that the cancel threads hung on CallableStatement.close() , so I reordered the cancel so that OracleConnection.abort() precedes CallableStatement.cancel() , and this fixes the problem - canceling all (apparently) was done correctly. Worker threads continued to hang on their applications, though - I suspect my VPN may be partially or completely to blame for this.

 PerformanceAsync-akka.actor.default-dispatcher-19 at java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I (Native Method) at java.net.SocketInputStream.read([BIII)I (SocketInputStream.java:150) at java.net.SocketInputStream.read([BII)I (SocketInputStream.java:121) at oracle.net.ns.Packet.receive()V (Unknown Source) at oracle.net.ns.DataPacket.receive()V (Unknown Source) at oracle.net.ns.NetInputStream.getNextPacket()V (Unknown Source) at oracle.net.ns.NetInputStream.read([BII)I (Unknown Source) at oracle.net.ns.NetInputStream.read([B)I (Unknown Source) at oracle.net.ns.NetInputStream.read()I (Unknown Source) at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1()S (T4CMAREngine.java:1109) at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1()B (T4CMAREngine.java:1080) at oracle.jdbc.driver.T4C8Oall.receive()V (T4C8Oall.java:485) at oracle.jdbc.driver.T4CCallableStatement.doOall8(ZZZZ)V (T4CCallableStatement.java:218) at oracle.jdbc.driver.T4CCallableStatement.executeForRows(Z)V (T4CCallableStatement.java:971) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout()V (OracleStatement.java:1192) at oracle.jdbc.driver.OraclePreparedStatement.executeInternal()I (OraclePreparedStatement.java:3415) at oracle.jdbc.driver.OraclePreparedStatement.execute()Z (OraclePreparedStatement.java:3521) at oracle.jdbc.driver.OracleCallableStatement.execute()Z (OracleCallableStatement.java:4612) at com.util.CPProcExecutor.execute(Loracle/jdbc/OracleConnection;Ljava/sql/CallableStatement;Lcom/controller/BaseJobRequest;)V (CPProcExecutor.java:57) 

However, even after correcting the cancellation order, I still have a problem when the actors system does not create enough threads: I still get only three of the eight workers in the new batch, while new workers are added as the canceled workers have a network connection timeout . In total, I have 11 threads - my main pool size, out of 29 threads - my maximum pool size. Apparently, the actor system is ignoring my maximum pool size setting, or I'm not setting the maximum pool size correctly.

+5
source share
2 answers

(Disclaimer: I do not know Akku)

In your next configuration, queue-size = -1, I think the task queue is not limited.

  "task-queue-size": "-1", "task-queue-type": "linked" 

ThreadPoolExecutor will not go out of the main pool if the work queue is not full and cannot stand in the queue. Only if the task queue is full will it start spawning to maximum threads.

If fewer corePoolSize threads are running, the Contractor always prefers to add a new thread rather than a sequence. If corePoolSize or more threads are running, the Contractor always prefers instead of adding a new thread. If the request cannot be in the queue, a new thread is created if it does not exceed maximumPoolSize, in which case the task will be rejected.

Please check if you can set a limited queue size and see if the thread grows to maximum threads. Thanks.

+5
source

There is not enough code to provide a solution, but when the system stops responding, you can check the system resource utility (cpu, ram), if they are not changed, check the Oracle database.

If you kill a connection group, a peer-to-peer task starts: I think there are blocking sessions at the oracle level (some write transaction blocks other write transactions on the same resources).

If in immunity state, check session lock:

 SELECT s1.username || '@' || s1.machine || ' ( SID=' || s1.sid || ' ) is blocking ' || s2.username || '@' || s2.machine || ' ( SID=' || s2.sid || ' ) ' AS blocking_status FROM v$lock l1, v$session s1, v$lock l2, v$session s2 WHERE s1.sid=l1.sid AND s2.sid=l2.sid AND l1.BLOCK=1 AND l2.request > 0 AND l1.id1 = l2.id1 AND l1.id2 = l2.id2 
0
source

All Articles