I would like to use server side selection and filtering using the cassandra spark connector. In fact, we have many sensors that send values every 1 s, we are interested in this aggregation of data using months, days, hours, etc. I proposed the following data model:
CREATE TABLE project1( year int, month int, load_balancer int, day int, hour int, estimation_time timestamp, sensor_id int, value double, ... PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)
Then we were interested in getting data aggregation on December 2014 using loadbalancer IN (0,1,2,3). Thus, they are 4 different sections .
We use the cassandra spark connector version 1.1.1, and we used the command on demand so that all values are aggregated by the clock.
Thus, the processing time for 4,341,390 tuples, the spark takes 11 minutes to return the result. Now the problem is that we use 5 nodes , but the spark uses only one worker to complete the task . Could you suggest an update for the query model or data to improve performance?