Spark Cassandra Aggregation

I would like to use server side selection and filtering using the cassandra spark connector. In fact, we have many sensors that send values ​​every 1 s, we are interested in this aggregation of data using months, days, hours, etc. I proposed the following data model:

 CREATE TABLE project1( year int, month int, load_balancer int, day int, hour int, estimation_time timestamp, sensor_id int, value double, ... PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id) 

Then we were interested in getting data aggregation on December 2014 using loadbalancer IN (0,1,2,3). Thus, they are 4 different sections .

We use the cassandra spark connector version 1.1.1, and we used the command on demand so that all values ​​are aggregated by the clock.

Thus, the processing time for 4,341,390 tuples, the spark takes 11 minutes to return the result. Now the problem is that we use 5 nodes , but the spark uses only one worker to complete the task . Could you suggest an update for the query model or data to improve performance?

+5
source share
1 answer

The Spark Cassandra Connector has this feature, SPARKC-25 . You can simply create an arbitrary RDD value with values ​​and then use it as a key source to retrieve data from the Cassandra table. Or, in other words, join an arbitrary RDD to Cassandra RDD. In your case, that arbitrary RDD will include 4 tuples with different load balancing values. See the documentation for more details. SCC 1.2 was recently released, and it is probably compatible with Spark 1.1 (it is intended for Spark 1.2, though).

0
source

Source: https://habr.com/ru/post/1215512/


All Articles