Kafka Streaming Concurrency?

Question

Kafka Streaming Concurrency?

I have the Kafka Streaming base code that reads records from one topic, does some processing, and outputs records to another topic.

How does Kafka concurrency streaming work? Does everything work in one thread? I do not see this in the documentation.

If it is single-threaded, I would like multithreaded processing parameters to process large amounts of data.

If it is multithreaded, I need to understand how it works and how to process resources, for example, connections to SQL databases, should be shared in different processing threads.

Is the Kafka built-in streaming API not recommended for scenarios with large volumes compared to other parameters (Spark, Akka, Samza, Storm, etc.)?

+8

apache-kafka apache-kafka-streams

clay Oct 11 '16 at 19:20

source share

2 answers

kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to just run multiple instances of your streaming application, all of which will work with the same user group. This way you can expand as many instances as necessary for optimal splitting.

+4

Nicholas Oct 11 '16 at 21:14

source share

Michael G. Noll · Accepted Answer · 2016-10-12T07:04:14+0000

How does Kafka concurrency streaming work? Does everything work in one thread? I do not see this in the documentation.

This is described in detail in http://docs.confluent.io/current/streams/architecture.html#parallelism-model . I don’t want to copy-paste it here verbatim, but I want to emphasize that IMHO the key to understanding is the section of sections (see Sections of Kafka sections, which are generalized in Kafka streams to “stream sections” like not all data streams that are processed, will go through Kafka), as the section currently defines parallelism for both Kafka (broker / server side) and thread-processing applications that use the Kafka Streams API (client side).

If it is single-threaded, I would like multithreaded processing parameters to process large amounts of data.

Section processing will always be performed by only one "thread", which ensures that you do not encounter concurrency problems. But...

If it is multithreaded, I need to understand how it works and how to process resources, for example, connections to SQL databases, should be shared in different processing threads.

... because Kafka allows a topic to have many sections, you get parallel processing. For example, if a topic has 100 sections, then up to 100 thread tasks (or several simplified ones: up to 100 different machines, each of which starts an instance of your application) can process this topic in parallel. Again, each thread task will have exclusive access to 1 section, which will then be processed.

Is the Kafka built-in streaming API not recommended for scenarios with large volumes compared to other parameters (Spark, Akka, Samza, Storm, etc.)?

The Kafka flow processing engine is definitely recommended, and is actually used in practice for large-volume scenarios. Benchmarking work is still in progress, but in many cases the Kafka Streams-based application is faster. See the LINE Engineer's Blog: Applying Kafka Streams to an Internal Messaging Pipeline for LINE Corp, one of Asia's largest social platforms (220M + users), for a description of how they use Kafka and the Kafka Streams API in production to handle millions of events. per second.

Kafka Streaming Concurrency?

More articles: