This is what I think a lot, because the answer to technical and non-technical people is always difficult to formulate.
I will try to answer this part:
Why is it inefficient to run a minipack with a 1 millisecond delay?
I believe that the problem is not in the model itself, but in how its sparks spark. This is empirical evidence that the mini-batch window reduction is too large, and performance is deteriorating. In fact, a time of at least 0.5 seconds or more was suggested to prevent this kind of degradation. On large volumes, even this window size was too small. I never had the opportunity to test it in production, but I never had a strong demand in real time.
I know Flink better than Spark, so I really don’t know about its internal components, but I believe that the overhead incurred in developing the batch process was irrelevant if your batch takes at least several seconds to process, but becomes heavy, if they introduce a fixed delay and you cannot go below that. To understand the nature of this overhead, I think you'll have to dig into Spark documentation, codes, and open issues.
It is now recognized in the industry that there is a need for a different model, and why many streaming-first engines are growing right now, and Flink is leading. I don’t think that these are just buzzwords and hype, because the options for using this kind of technology, at least for the moment, are extremely limited. Basically, if you need to make an automated decision in real time on large complex data, you need a fast real-time data transfer mechanism. In any other case, including in real time, streaming in real time is a surplus and the mini-packet is in order.
Chobeat
source share