Difference between Apache Samza and Apache Kafka Streams (focus on parallelism and communication)

Question

Difference between Apache Samza and Apache Kafka Streams (focus on parallelism and communication)

In Samza and Kafka threads, data stream processing is performed in a sequence / graph (called a “data stream diagram” in Samza and “topology” in Kafka flows) of the processing steps (called “work” in Samza and “processor”, in Kafka flows). In the remainder of this question, I will refer to these two terms as a workflow and employee.

Suppose we have a very simple workflow consisting of worker A, which consumes sensor measurements and filters all values below 50, followed by worker B, which receives the remaining measurements and filters all values above 80.

Input (Kakfa X theme) → (Work A) → (Work B) → Output (Kafka topic Y)

If i get it

correctly, both Samza and Kafka Streams use the concept of partitioning to replicate workflow / workers and thus parallelize processing for scaling purposes.

But:

Samza replicates each worker (i.e. the task) separately for several tasks (one for each section in the input stream). That is, the task is a replica of the workflow.
Kafka threads replicate the entire workflow (i.e., topology) to several tasks at once (one for each section in the input stream). That is, the task is a replica of the entire workflow.

This brings me to my questions:

Suppose there is only one section: is it correct that it is not possible to deploy workers (A) and (B) on two different machines in Kafka flows, while this is possible in Samza? (Or in other words: is it not possible in Kafka threads to split a single task (i.e., a topographic replica) into two machines, regardless of whether there are several partitions or not.)
How do the two subsequent processors interact in the Kafka Streams topology (in the same task)? (I know that in Samsa, all communication between two subsequent employees (i.e. Works) is carried out with Kafka topics, but since you need to “mark” Kafka threads explicitly in code whose threads should be published as Kafka topics, this cannot in this case.)
Is it true that Samza also publishes all intermediate streams automatically as Kafka themes (and therefore makes them available to potential customers), while Kafka Streams publishes only those intermediate and final streams that are marked explicitly (with addSink in the low API level and to or through in DSL)?

(I know that Samza can use other message queues than Kafka, but this is not very relevant for my questions.)

+8

apache-kafka apache-kafka-streams

Lukas Probst Dec 9 '16 at 15:46

source share

1 answer

Guozhang wang · Answer 1 · 2016-12-30T01:13:20+0000

First of all, both in Samza and in Kafka Streams, you can choose an intermediate topic between these two tasks (processors) or not, that is, the topology can be:

Input (Kakfa X theme) → (Work A) → (Work B) → Output (Kafka topic Y)

or

Login (Kakfa topic X) → (Work A) → Intermediate (Kafka topic Z) → (Work B) → Exit (Kafka topic Y)

In both Samza and Kafka Streams, in the first case you will have to deploy Worker A and B together, while in the latter case you cannot deploy Worker A or B together, since the framework only interacts within any tasks through intermediate topics and there are no TCP-based communication channels.

In Samza, for the first case you need to encode two filters as in one task, and for the last case you need to specify the input and output topic for each of the tasks, for example. for Worker A is X, and output is Z, for input B is Z, and output is Y, and you can start / stop deployed workers yourself.

In Kafka threads for the first case, you can simply "combine" these processors, for example

stream1.filter (..). Filter (..)

and as a result, as Lucas noted, each result of the first filter will be immediately transferred to the second filter (you can think of each input record from topic X crossing the topology in order of depth and not buffering between any directly connected processors);

And for the latter case, you can indicate that the intermediate stream will be “materialized” in another topic, that is:

stream1.filter (..). (Via "topicZ"). Filter (..)

and each result of the first filter will be sent to topic Z, which will then be pipelined to the second filter of the filter. In this case, these two filters can potentially be deployed on different hosts or on different threads within the same host.

Difference between Apache Samza and Apache Kafka Streams (focus on parallelism and communication)

More articles: