Determining how full a cassandra cluster is

I just imported a lot of data into a 9 node Cassandra cluster, and before I create a new ColumnFamily with even more data, I would like to determine how full my cluster is (in terms of memory usage). I'm not too sure what I need to see. I do not want to import another 20-30 GB of data and I understand that I had to add another 5-6 nodes.

In short, I have no idea if I have too few / many nodes right now for what's in the cluster.

Any help would be greatly appreciated :)

$ nodetool -h 192.168.1.87 ring Address DC Rack Status State Load Owns Token 151236607520417094872610936636341427313 192.168.1.87 datacenter1 rack1 Up Normal 7.19 GB 11.11% 0 192.168.1.86 datacenter1 rack1 Up Normal 7.18 GB 11.11% 18904575940052136859076367079542678414 192.168.1.88 datacenter1 rack1 Up Normal 7.23 GB 11.11% 37809151880104273718152734159085356828 192.168.1.84 datacenter1 rack1 Up Normal 4.2 GB 11.11% 56713727820156410577229101238628035242 192.168.1.85 datacenter1 rack1 Up Normal 4.25 GB 11.11% 75618303760208547436305468318170713656 192.168.1.82 datacenter1 rack1 Up Normal 4.1 GB 11.11% 94522879700260684295381835397713392071 192.168.1.89 datacenter1 rack1 Up Normal 4.83 GB 11.11% 113427455640312821154458202477256070485 192.168.1.51 datacenter1 rack1 Up Normal 2.24 GB 11.11% 132332031580364958013534569556798748899 192.168.1.25 datacenter1 rack1 Up Normal 3.06 GB 11.11% 151236607520417094872610936636341427313 

-

 # nodetool -h 192.168.1.87 cfstats Keyspace: stats Read Count: 232 Read Latency: 39.191931034482764 ms. Write Count: 160678758 Write Latency: 0.0492021849459404 ms. Pending Tasks: 0 Column Family: DailyStats SSTable count: 5267 Space used (live): 7710048931 Space used (total): 7710048931 Number of Keys (estimate): 10701952 Memtable Columns Count: 4401 Memtable Data Size: 23384563 Memtable Switch Count: 14368 Read Count: 232 Read Latency: 29.047 ms. Write Count: 160678813 Write Latency: 0.053 ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 115533264 Key cache capacity: 200000 Key cache size: 1894 Key cache hit rate: 0.627906976744186 Row cache: disabled Compacted row minimum size: 216 Compacted row maximum size: 42510 Compacted row mean size: 3453 

-

 [ default@stats ] describe; Keyspace: stats: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] Column Families: ColumnFamily: DailyStats (Super) Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: (removed) Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor 
+7
source share
1 answer

Obviously, there are two types of memory - disk and RAM. I assume you are talking about disk storage.

First you need to find out how much space you use for the node. Verify that cassandra dir data is used on the disk (by default /var/lib/cassandra/data ) with this command: du -ch /var/lib/cassandra/data Then you must compare this with the size of your disk, which can be found using df -h . Only consider the entry for the df results for the disk on which your cassandra data is recorded by checking the "Installed" column.

Using these statistics, you should be able to calculate how cassandra's data section is filled in%. As a rule, you do not want to get too close to 100%, because normal cassandra compression processes temporarily use more disk space. If you are missing, then the node can be caught with a full disk, which can be painful for a solution (as I noticed, I sometimes save a β€œballast” file from several concerts, which I can only delete if I have to open additional space). As a rule, I found that no more than 70% of disk usage is on the safe side for the 0.8 series.

If you are using a newer version of cassandra, I would recommend giving Leveled Compaction strategies to reduce temporary disk usage. Instead of potentially using twice as much disk space, the new strategy will in most cases use 10x of a small fixed size (5 MB by default).

You can learn more about how compression temporarily increases disk usage on this excellent Datastax blog post: http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra It also explains compaction strategies .

So, to do a little capacity planning, you can determine how much space you need. With a replication ratio of 3 (which you use above), adding 20-30 GB of raw data will add 60-90 GB after replication. Divide between 9 nodes, which may be 3 GB more per node. Does this add disk usage for node too close to having full disk? If so, you might want to add more nodes to the cluster.

One more note: the load of your nodes is not very clear - from 2 GB to 7 GB. If you use ByteOrderPartitioner over random, it can cause uneven loading and hot spots in your ring. You should use random if possible. Another possibility may be that you have additional data that you need to take care of (hints and pictures come to this). Try to clean this by running nodetool repair and nodetool cleanup on each node one at a time (be sure to read what they do first!).

Hope this helps.

+10
source

All Articles