Creating a materialized view from a table with a large amount of data in Cassandra

We have a Cassandra cluster with three pods in Google Cloud Kubernetes. Our version of Cassandra is 3.9, we use Google images.

I had a problem when I tried to create a materialized view from a table.

The table layout is similar:

CREATE TABLE environmental_data ( block_id int, timestamp timestamp, device_id int, sensor_id int, . . . PRIMARY KEY (block_id, timestamp, device_id, sensor_id) 

I want to create a view with device_id as a cluster key, I tried to do this:

 CREATE MATERIALIZED VIEW environmental_data_by_device AS SELECT block_id, timestamp, device_id, sensor_id,... FROM environmental_data WHERE block_id is not null and timestamp is not null and device_id is not null and sensor_id is not null PRIMARY KEY ((device_id), timestamp, sensor_id, block_id) WITH CLUSTERING ORDER BY (timestamp DESC); 

In a local with very little data, everything went well. But in a production with 80 million lines, 2 pods crashed, and Cassandra was fixated on this error:

An unknown exception occurred while trying to update MaterializedView! environmental_data p>

java.lang.IllegalArgumentException: XXXX byte mutation is too large for maximum size XXXX

There was also a lot of java.lang.OutOfMemoryError: Java heap space

What can I do to make sure the next attempt is successful? To delay the production of Cassandra for a second time is actually unthinkable.

I have already managed to create a database of views on the table, but it was not so big.

+7
cassandra kubernetes google-cloud-platform google-kubernetes-engine
source share
1 answer

I can give you some hints from an infrastructure point of view since I don’t know Cassandra in depth. If I were responsible for the infrastructure, I would verify that you configured the deployments correctly to make sure that you had a lot of java.lang.OutOfMemoryError , which:

  • Plugins are scheduled on nodes that are capable or supporting their workload, and the scheduler is informed of the memory needed for the containers. In this case, you must set the memory request .

This often does not stand out, but it can be a problem: if you have 3 nodes having 3 GB of RAM and 2 pods consuming 2 GB without a memory request, it may happen that they are scheduled for the same node that will be killed after some time . Setting up a memory request is all going well.

  1. Packages do not consume more memory than expected, and in the event of a memory leak, healthy pods are killed by pods that consume too much memory. In this case, you must set a memory limit .

Moreover, you can find here an interesting article that also affects Java Heap Memory Based AutoScaling in Kubernet.

You can check how much memory and processor the pods consume using the following command:

 $ kubectl top pods --namespace=xxxx 

And if the nodes suffer

 $ kubectl top nodes 
-one
source share

All Articles