What is the difference between a mini-batch and a real-time stream in practice (rather than theory)?

What is the difference between mini-batch and streaming streams in practice (rather than theory)? Theoretically, I understand that a mini-batch is what happens in a given period of time, while real-time streaming time is more like what comes in, but my biggest question is why not have a mini-packet with a temporary epsilon interval (say, one millisecond) or would I like to understand the reason why it would be an effective solution than another?

I recently came across one example where the Apache Spark mini-package is used to detect fraud and real-time streaming (Apache Flink) used to prevent fraud. Someone also commented that mini-parties would not be an effective solution to prevent fraud (since the goal is to prevent the occurrence of the transaction, as it happened). Now I wonder why this is not so effective with the Spark mini-package? Why is it inefficient to run a minipack with a delay of 1 millisecond? Packing is a method used everywhere, including the OS and the TCP / IP stack of the kernel, where the data on the disk or network is really buffered so what is the convincing factor here to say that one is more efficient than the other?

+8
batch-processing apache-spark apache-flink data-processing stream-processing
source share
3 answers

Disclaimer: I am a member of the committer and member of the PMC Apache Flink. I am familiar with the general design of Spark Streaming, but I don’t know its internal details in detail.

The mini-packet stream processing model implemented by Spark Streaming works as follows:

  • Stream entries are collected in a buffer (mini-packet).
  • Periodically collected records are processed using the regular Spark job. This means that for each mini-batch, a complete task of distributed batch processing is planned and completed.
  • While the task is in progress, records are being collected for the next batch.

So, why is it inefficient to run a minipack every 1 ms? Just because it would mean scheduling a distributed batch job every milliseconds. Although Spark works very fast in job planning, that would be too much. It will also significantly reduce possible bandwidth. Dosing methods used in the OS or TCP also work poorly if their lots become too small.

+7
source share

I know that one answer was accepted, but I think one more should answer this question completely. I think the answer like “Real-time Flink is faster / better for streaming” is incorrect, because it depends a lot on what you want to do.

The Spark mini-batch model has, as was written in the previous answer, the disadvantage that a new task must be created for each mini-batch.

However, Spark Structured Streaming has a default processing time trigger of 0, which means that reading new data is done as quickly as possible. It means that:

  • one request is launched
  • but the first request did not end
  • The 1st request is completed, so the data will be processed directly.

The delay in such cases is very small.

One big advantage over Flink is that Spark has unified APIs for batch and stream processing because of this mini-batch model. You can easily transfer a batch job to a streaming job, combine streaming data with old data from the batch. Doing this with Flink is not possible. Flink also does not allow you to perform interactive queries with the data you receive.

As mentioned earlier, use cases are different for micropackages and real-time streaming:

  • For very small delays, Flink or some compute grids like Apache Ignite would be good. They are suitable for processing with very low latency, but not with very complex calculations.
  • For medium and large delays, Spark will have a more unified API that will allow you to perform more complex calculations in the same way as batch jobs, just because of this union

For more on Structured Streaming, see this blog post.

+5
source share

This is what I think a lot, because the answer to technical and non-technical people is always difficult to formulate.

I will try to answer this part:

Why is it inefficient to run a minipack with a 1 millisecond delay?

I believe that the problem is not in the model itself, but in how its sparks spark. This is empirical evidence that the mini-batch window reduction is too large, and performance is deteriorating. In fact, a time of at least 0.5 seconds or more was suggested to prevent this kind of degradation. On large volumes, even this window size was too small. I never had the opportunity to test it in production, but I never had a strong demand in real time.

I know Flink better than Spark, so I really don’t know about its internal components, but I believe that the overhead incurred in developing the batch process was irrelevant if your batch takes at least several seconds to process, but becomes heavy, if they introduce a fixed delay and you cannot go below that. To understand the nature of this overhead, I think you'll have to dig into Spark documentation, codes, and open issues.

It is now recognized in the industry that there is a need for a different model, and why many streaming-first engines are growing right now, and Flink is leading. I don’t think that these are just buzzwords and hype, because the options for using this kind of technology, at least for the moment, are extremely limited. Basically, if you need to make an automated decision in real time on large complex data, you need a fast real-time data transfer mechanism. In any other case, including in real time, streaming in real time is a surplus and the mini-packet is in order.

+3
source share

All Articles