Kafka batch size limit when using Spark Streaming

Is it possible to limit the size of batches returned by a Kafka consumer for Spark Streaming?

I ask because the first batch I receive contains hundreds of millions of records, and it takes a lot of time to process and checkpoint.

+6
source share
2 answers

I think your problem can be solved with Spark Streaming Backpressure .

Check out spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate .

By default, spark.streaming.backpressure.initialRate not set and spark.streaming.backpressure.enabled by default, so I assume the spark will take as much as possible.

From Apache Spark Kafka Configuration

spark.streaming.backpressure.enabled :

This allows Spark Streaming to control the reception rate for the current scheduling delays and processing time, so that the system receives only the same that the system can process. Internally, this dynamically sets the maximum receive rate of receivers. This top speed is limited by the values ​​of spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition , if installed (see below).

And since you want to control the first batch or, to be more specific, the number of messages in the first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate :

This is the initial maximum reception rate at which each recipient will receive data for the first batch when the backpressure mechanism is on.

It’s good when your Spark work (for example, Spark employees) can process, say 10,000 messages from kafka, but the kafka browsers give your work 100,000 messages.

You might also be interested in checking out spark.streaming.kafka.maxRatePerPartition , as well as some research and suggestions on these properties with a real-life example of Jeroen van Wilgenburg on your blog .

+16
source

Beyond the answers above. Batch size - product of 3 parameters

  • batchDuration: time interval during which the streaming data will be divided into batches (in seconds).
  • spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages for each section per package.
  • There are no sections in the topic of Kafka

For a better explanation of how this product works when you enable / disable backpressure ( set spark.streaming.kafka.maxRatePerPartition to createDirectStream )

0
source

All Articles