Based on your answer that the blocking call is to compare the provided input with each individual element in RDD, I would strongly recommend rewriting the comparison in java / scala so that it can be run as part of your spark process. If the comparison is a "clean" function ( no side effects depend only on its inputs), this should be simple for re-implementation, as well as reducing complexity and increasing stability in the spark process due to the lack of the need to delete calls, probably, will be worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so the local version in the process would be preferable.
If this is completely impossible for some reason, then you can create an RDD transform that turns your data into futures RDD in pseudo-code:
val callRemote(data:Data):Future[Double] = ... val inputData:RDD[Data] = ... val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then continue from there, computing your Future [Double] objects.
If you know how much parallelism your remote process can handle, it is best to abandon Future mode and accept that it is a bottleneck.
val remoteParallelism:Int = 100
Your work will probably take quite some time, but it should not overrun your remote service and die horribly.
The final option is that if the inputs are reasonably predictable and the range of results is consistent and limited by some reasonable number of outputs (millions or so), you can pre-copy them all as a dataset using a remote service and find them when sparking using a connection .
DPM
source share