Spark Streaming Kafka Back Pressure

Question

Spark Streaming Kafka Back Pressure

We have a Spark Streaming application, it reads data from the Kafka queue in the receiver and does some conversion and output to HDFS. The interval between batches is 1 minute, we have already set the backpressure parameters and spark.streaming.receiver.maxRatetherefore it works fine in most cases.

But we still have one problem. When the HDFS is completely down, the batch job will freeze for a long time (say, HDFS does not work for 4 hours and the job will hang for 4 hours), but the recipient does not know that the job is not finished yet, so it still receives data in over the next 4 hours. This causes an OOM exception, and the whole application does not work, we lost a lot of data.

So my question is: is it possible for the recipient to know that the task does not end, so he will receive less (or even not) data, and when the work is finished, he will begin to receive more data in order to catch up. In the above condition, when HDFS is not working, the receiver will read less data from Kafka, and the block generated over the next 4 hours is really small, the receiver and the whole application do not drop after HDFS is ok, the receiver will read more data and start catching up.

+5

streaming apache-spark apache-kafka backpressure spark-streaming-kafka

Yichaocai Apr 15 '16 at 7:57

source share

1 answer

richardstartin · Answer 1 · 2016-12-04T12:36:49+0000

, spark.streaming.backpressure.enabled=true. , OOM . :

spark.streaming.backpressure.pid.proportional - ( 1.0)
spark.streaming.backpressure.pid.integral - - ( 0.2)
spark.streaming.backpressure.pid.derived - ( , 0.0)
spark.streaming.backpressure.pid.minRate - , , , ( 100)

,

Spark Streaming Kafka Back Pressure

More articles: