I am currently working on the same implementation. I use Kafka, HBase, Spark and Spark Streaming.
There are many things to consider when using these technologies, and there is probably no simple answer.
The highlights of Spark Streaming are that you get a minimum delay of 100 ms for stream data, and also another big problem for me is the mess of ordering the data consumed by the stream job. The fact that with a combination of potential stragglers leads to complete distrust in the fact that I process the data, at least in partial order (as far as I know, at least). The storm seems to solve these problems, but I can’t guarantee this, since I didn’t use it.
In terms of the packet level, Spark is definitely better than MapReduce because it is faster and more flexible.
Then there is the problem of synchronization between Batch and Speed in terms of knowing where the batch job data stops the speed. I solve this problem by having my own speed layer and one that puts data in HBase before doing the processing on it.
These are just a bunch of random dots, I hope some of them help.
source share