Best practices for running Spark streaming and Spark simultaneously in the same cluster

Question

Best practices for running Spark streaming and Spark simultaneously in the same cluster

I am actually deploying the Spark / Kafka / Cassandra application, and I ran into a problem with various solutions, so I am here to take your advice.

I have been running the Spark streaming app for a long time, which is to deal with the Avro message in Kafka. Depending on the nature of the message, I can do several different things and, finally, save the record in Cassandra, so it’s just a basic use case for these technologies.
I have a second job, which is the work of Spark, it receives some data in Cassandra, performs some conversions ... I have not yet determined the frequency of work, but it will be from 1 time per second hour to 1 time per day, usually it is big batch job.

So, I am looking for the best practice for completing my batch assignment. Since working with the spark flow takes all the resources in the cluster during operation, in my opinion, I have two solutions:

Including a batch of spark in a spark-forming “micro” batch with an hourly interval, for example, Pro: easy to do, optimize resource allocation
Cons: Not very clean, large spacing for micro-batch (what is the behavior of Spark in this case?)
Save Resources for Spark Jobs in Pro Cluster: Clean
Cons: resource allocation is not optimized because some processor will not do anything for a while

Therefore, it is very interesting for me to get your advice and some experience that you will get in such cases.

+4

cassandra apache-spark apache-kafka spark-streaming

Junayy Aug 12 '15 at 7:46

source share

1 answer

huitseeker · Answer 1 · 2015-11-07T09:36:16+0000

You can use dynamic allocation on yarn and on Mesos so that your tasks consume resources only when they need to.

Best practices for running Spark streaming and Spark simultaneously in the same cluster

More articles: