Spark streaming with Kafka - createDirectStream vs createStream

Question

Spark streaming with Kafka - createDirectStream vs createStream

We used sparking with kafa for a while, and so far we have used the createStream method from KafkaUtils .

We just started to learn createDirectStream and, as it was for two reasons:

1) Better / easier "exactly once" semantics

2) Better correlation of a kafka partition section with rdd partitions

I noticed that createDirectStream marked as experimental. I have a question (sorry if this is not very specific):

Should we study the createDirectStream method if once is very important to us? It will be great if you guys can share your experience with this. Do we run the risk of encountering other problems, such as reliability, etc.?

+15

apache-spark apache-kafka spark-streaming

Shay Jul 19 '16 at 21:34

source share

1 answer

Yuval Itzchakov · Answer 1 · 2016-07-20T09:25:47+0000

The direct approach creator (Cody) here offers a great extensive blog.

In general, reading the section on Kafka delivery semantics, the last part reads:

Thus, Kafka guarantees delivery at least once by default and allows the user to deliver no more than once, disabling the retries of the manufacturer and fixing its offset before processing the message packet. Exactly once, delivery requires collaboration with the destination storage system, but Kafka provides an offset that makes the implementation of this simple.

This basically means "we give you at least once out of the box, if you want exactly once that’s on you." In addition, the blog post talks about the guarantee of “exactly once” semantics that you get from Spark with both approaches (direct and recipient-based, emphasized mine):

Secondly, understand that Spark does not guarantee exactly once the semantics for output actions. When the Spark streaming guide speaks about once, this only applies to this element in the SDR is included in the calculated value once, in a purely functional sense. Any side output operations (that is, everything that you do in foreachRDD to save the result) can be repeated, because any stage of the process may fail and will be repeated.

In addition, the Spark documentation talks about recipient-based processing:

The first approach (based on Receiver) uses the high-level Kafkas API to store consumed offsets in Zookeeper. This is traditionally a way of consuming data from Kafka. While this approach (in combination with the logging ahead) can provide zero data loss (i.e., at least once semantics), there is a small chance that some records can be used twice for some failures.

Essentially, this means that if you use a Receiver-based stream with Spark, you can still have duplicate data in case of unsuccessful output conversion, at least once.

In my project, I use a direct stream, where the delivery semantics depend on how you process them. This means that if you want to provide semantics exactly once, you can save offsets along with data in a transaction, for example, if one failure, the other also failure.

I recommend reading the blog post (link above) and delivery semantics on the Kafka documentation page . In conclusion, I definitely recommend that you turn to the direct streaming approach.

Spark streaming with Kafka - createDirectStream vs createStream

More articles: