Best practices for running Spark streaming and Spark simultaneously in the same cluster

I am actually deploying the Spark / Kafka / Cassandra application, and I ran into a problem with various solutions, so I am here to take your advice.

  • I have been running the Spark streaming app for a long time, which is to deal with the Avro message in Kafka. Depending on the nature of the message, I can do several different things and, finally, save the record in Cassandra, so it’s just a basic use case for these technologies.

  • I have a second job, which is the work of Spark, it receives some data in Cassandra, performs some conversions ... I have not yet determined the frequency of work, but it will be from 1 time per second hour to 1 time per day, usually it is big batch job.

So, I am looking for the best practice for completing my batch assignment. Since working with the spark flow takes all the resources in the cluster during operation, in my opinion, I have two solutions:

  • Including a batch of spark in a spark-forming β€œmicro” batch with an hourly interval, for example, Pro: easy to do, optimize resource allocation
    Cons: Not very clean, large spacing for micro-batch (what is the behavior of Spark in this case?)

  • Save Resources for Spark Jobs in Pro Cluster: Clean
    Cons: resource allocation is not optimized because some processor will not do anything for a while

Therefore, it is very interesting for me to get your advice and some experience that you will get in such cases.

+4
source share
1 answer

You can use dynamic allocation on yarn and on Mesos so that your tasks consume resources only when they need to.

0
source

All Articles