Lambda architecture with Apache Spark

I am trying to implement the Lambda architecture using the following tools: Apache Kafka to get all the data, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.

In addition, all the data received is associated with a user session, and therefore, for batch processing, I am only interested in processing the data as soon as the session ends. So, since I use Kafka, the only way to solve this problem (provided that all data points are stored in the same subject) is to have the packet receive all messages in the subject and then ignore those that correspond to the sessions that not finished yet.

So, I would like to ask:

  • Is this a good approach to implementing lambda architecture? Or should Haddop and Storm be used instead? (I cannot find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
  • Is there a better approach to solve the problem of user sessions?

Thanks.

+5
source share
3 answers

This is a good approach. Using Spark for both speed and batch layers allows you to write logic once and use it in both contexts.

As for your session problem, since you are doing it in batch mode, why not just swallow data from Kafka to HDFS or Cassandra and then write requests for full sessions? You can use the direct connection of Spark Streaming with Kafka for this.

+4
source

I am currently working on the same implementation. I use Kafka, HBase, Spark and Spark Streaming.

There are many things to consider when using these technologies, and there is probably no simple answer.

The highlights of Spark Streaming are that you get a minimum delay of 100 ms for stream data, and also another big problem for me is the mess of ordering the data consumed by the stream job. The fact that with a combination of potential stragglers leads to complete distrust in the fact that I process the data, at least in partial order (as far as I know, at least). The storm seems to solve these problems, but I can’t guarantee this, since I didn’t use it.

In terms of the packet level, Spark is definitely better than MapReduce because it is faster and more flexible.

Then there is the problem of synchronization between Batch and Speed ​​in terms of knowing where the batch job data stops the speed. I solve this problem by having my own speed layer and one that puts data in HBase before doing the processing on it.

These are just a bunch of random dots, I hope some of them help.

0
source

I will repeat Dean Wempler, noting that this is a good approach, especially if you do not have special requirements that will allow you to distract you from Spark as a tool of choice for the levels of Batch and Speed. To add:

You do not need to re-consume all the data for the session from the topic before you can process it, assuming that you are working with it (your abbreviation) is an associative operation. Even if it is not associative (such as Unique Users), you can still be fine with a very accurate estimate, which can be calculated iteratively, like Hyper Log Log. Most likely, you will use some kind of state aggregation. In Spark, you can do this either using updateStateByKey, or preferably using mapWithState functions.

If you are looking for specific examples of specific technologies and use cases that you mentioned, I will point you to the Pluralsight course, where you can learn all about it and practice it. Using Lambda architecture with Spark, Kafka and Kassandra

I will also notice that if you do it fairly straightforward, and because you are already using Kafka, you might want to consider Kafka Connect for saving HDFS and Kafka streams for streaming. You can even use Kafka streams to stream data directly to Kafka and use Kafka Connect to transfer it to several destinations such as Cassandra and ElasticSearch. I mention Kafka streams because it also has the ability to hold some state in memory and perform simple stream operations.

Good luck

0
source

All Articles