I am actually deploying the Spark / Kafka / Cassandra application, and I ran into a problem with various solutions, so I am here to take your advice.
I have been running the Spark streaming app for a long time, which is to deal with the Avro message in Kafka. Depending on the nature of the message, I can do several different things and, finally, save the record in Cassandra, so itβs just a basic use case for these technologies.
I have a second job, which is the work of Spark, it receives some data in Cassandra, performs some conversions ... I have not yet determined the frequency of work, but it will be from 1 time per second hour to 1 time per day, usually it is big batch job.
So, I am looking for the best practice for completing my batch assignment. Since working with the spark flow takes all the resources in the cluster during operation, in my opinion, I have two solutions:
Including a batch of spark in a spark-forming βmicroβ batch with an hourly interval, for example, Pro: easy to do, optimize resource allocation
Cons: Not very clean, large spacing for micro-batch (what is the behavior of Spark in this case?)
Save Resources for Spark Jobs in Pro Cluster: Clean
Cons: resource allocation is not optimized because some processor will not do anything for a while
Therefore, it is very interesting for me to get your advice and some experience that you will get in such cases.
source share