What is the best practice when developing a Cassandra data model?

Question

What is the best practice when developing a Cassandra data model?

What pitfalls should be avoided? Are there any deals for you? For example, I heard that it’s very difficult to export / import Cassandra data, making me wonder if synchronizing production data with the development environment will hurt.

By the way, it is very difficult to find good lessons on Cassandra, the only thing that I have http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model is still quite elementary.

Thank.

+60

cassandra nosql database-design

Jerry 01 Oct '09 at 8:51

source share

5 answers

MarkR · Answer 1 · 2009-10-03 06:26

For me, the main thing is to decide whether to use OrderedPartitioner or RandomPartitioner.

If you use RandomPartitioner, range scanning is not possible. This means that you must know the exact key for any activity, INCLUDING CLEANING OF OLD DATA.

So, if you have a lot of garbage, if you don’t have a magical way to find out which keys you are inserting using a random separator, you can easily “lose” the material, which causes a disk leak and will eventually consume all the storage.

On the other hand, you can request the ordered delimiter "what keys do I have in the column family X between A and B"? - and that will tell you. You can clean them.

However, there are also disadvantages. Since Cassandra does not perform automatic load balancing, if you use an ordered delimiter, then in all likelihood all your data will go to one or two nodes, but not to others, which means that you will waste resources.

I have no easy answer to this question, except that you can get the “best of both worlds” in some cases by putting a short hash value (which you can easily list from other data sources) at the beginning of your keys - for example, 16 user identifier hexadecimal hash - which will give you 4 hexadecimal digits followed by any key that you really want to use.

Then, if you have a list of recently deleted users, you can simply hash their identifiers and range scan to clear everything related to them.

The next difficult bit is the secondary indexes - Cassandra doesn't have them - so if you need to look for X by Y, you need to insert data under both keys or a pointer. Similarly, these pointers may need to be cleared when the thing they are pointing to does not exist, but there is no easy way to request material on this basis, so your application should just remember.

And application errors can leave orphaned keys forgotten by you, and you cannot easily find them, unless you write a garbage collector that periodically scans each key in db (this will take some time - but you can do it in pieces) to check those that are no longer needed.

None of this is based on actual use, only what I found out during the study. We do not use Cassandra in production.

EDIT: Cassandra now has secondary indexes in the torso.

jbellis · Answer 2 · 2009-12-17 15:16

It is too long to add a comment to clarify some misconceptions from the answer to the list of problems:

Any client can connect to any node; if the first node that you select (or you connect through the load balancer) is down, just connect to another. In addition, an "fat client" api is available where the client can direct the entries themselves; an example is at http://wiki.apache.org/cassandra/ClientExamples
The time when the server does not respond to requests, but does not hang indefinitely, is a function that most people who dealt with overloaded rdbms systems would wish for. Cassandra RPC timeout is configurable; if you want, you can install it for a few days and instead deal with a hangover indefinitely. :)
It is true that there is no support with multiple or truncation yet, but there are corrections for both of them in the review.
Obviously, there is a trade-off in maintaining the load balance between the cluster nodes: the more balanced you try to keep things, the more data movement you will do, which is not free. By default, new Cassandra cluster nodes will move to the optimal position in the marker ring to minimize unevenness. In practice, this has been shown to work well, and the larger your cluster, the less true that doubling is optimal. This is described at http://wiki.apache.org/cassandra/Operations

Alice · Answer 3 · 2009-10-04 02:47

Another tutorial: http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ .

Igor Katkov · Answer 4 · 2009-11-05 22:36

Are there any deals for you? It’s not necessary to use switches, but to know something

The client connects to the nearest node, the address of which it must know in advance, all messages with all other Cassandra nodes are proxied through it. but. read / write traffic is distributed unevenly among nodes - some nodes proxy more data than they themselves b. If the node goes down, the client is helpless, cannot read, cannot write anywhere in the cluster.
Despite the fact that Cassandra claims that “recordings never fail”, they do fail, at least at the time of their performance. If the target node data becomes sluggish, request time and write are not executed. There are many reasons why a node stops responding: the garbage collector starts, the compaction process, whatever ... In all such cases, all write / read attempts fail. In a regular database, these queries would become slow, but in Kassandra they simply fail.
There is multi-user mode, but there is no multi-deletion, and one of them cannot crop ColumnFamily either
If new, empty node data is entered into the cluster, some of the data from one neighboring node on the key ring will be transmitted only. This leads to uneven distribution of data and uneven load. You can fix this by doubling the number of nodes. One should also track markers manually and select them wisely.

ɭɘ ɖɵʊɒɼɖ 江戸 · Answer 5 · 2013-04-04 14:33

I think it deserves an update since Cassandra 1.2 was released recently.

I have been using Cassandra in production for the past 18 months for social games.

My thing is that you should use Cassandra for your strengths. Therefore, a good understanding of what and how it does, you need to see which data model to use or even determine whether another database solution is more useful to you.

OrderedPartitioner is only useful if your application relies on key range requests, but you refuse one of Cassandra’s most powerful features for this: automatic scaling and load balancing. Instead of querying for rows of key rows, try to implement the same functions that you need using the ranges of column names on the same row. TL; DR read / write will NOT be balanced between nodes using this.

RandomPartioner (md5 hashing) and MurmurPartitioner (murmur hashing is better and faster) is how YOU SHOULD go if you want to support big data and high access frequencies.The only thing you drop is the key range requests. Everything that is on the same row is still on the same node in the cluster, and you can use comparator range and column queries. TL; DR: USE IT FOR PROPER BALANCING, you won’t give up anything basic.

Things you should know about Cassandra:

CASSANDRA EVENTS TOGETHER. Cassandra decided to trade Consistency for high availability and great breakdown ( http://en.wikipedia.org/wiki/CAP_theorem ). BUT you can get consistency from cassandra, it's all about the Consistency policy when you read and write. This is a pretty important and complicated topic when we talk about using cassandra, but you can read more about it here http://www.datastax.com/docs/1.2/dml/data_consistency .

As a rule (and to make it simple), I read and write in QUORUM ConsistencyLevel (since in my applications reading has the same frequency order as writing). If your application writes heavily heavily and reading is much less common, then use write on ONE and read EVERYTHING. Or, if your use case is the other way around (writing is much less common than reading), you can try reading ONE and writing on ALL. Using ANY as a level of consistency for records is not a great idea if the sequence is what you are trying to solve, as it ensures that the mutation reaches the cluster but has not been written anywhere. This is the only time I get notes to calmly fall through on cassandra.

These are simple rules that make working with cassandra easier. To get as much consistency and performance as possible from a production cluster, you should study this topic and understand it yourself.

If you need a human-readable date model with complex relationships between entities (tables), then I don’t think Cassandra is for you. MySQL and possibly NewSQL may be more useful for your use.

It’s good to know how, roughly speaking, cassandra stores and reads data. Whenever you write (it actually deletes the tombstone value in cassandra), the system will put the new value and timestamp in a new physical location.

When you read, cassandra tries to pull out all the records for a specific location key / column_name and returns you the very last thing it could find (the one with the highest timestamp provided by the client). Thus, the memory required for a node directly depends on the write frequencies. In cassandra, a compaction process takes place that takes care of cleaning old mutations. Cassandra has an internal cache that updates when read with the most recent location value.

Merging / compressing SSTables on the disk (data structures that store data) can be caused by reading, but it's better not to rely on it. Cleaning tombstones and expired columns (using the "lifetime" functionality) is another mechanism controlled by the garbage collector (see GC grace time settings for more details).

This leads me to the last point I want to make: make sure that your writing and reading are balanced in your cluster!

Suppose all of your users need to update one location very often.
DO NOT DIRECT this theoretical single place to just one line! This will make all your records fall on only one node in your cluster. If this does not knock everything down (because you have rockstar systops), it will at least greatly degrade cluster performance.
My advice is to write your entries to several different row keys, which you will distribute to all nodes of the cluster. To get all the data for this single theoretical location, use multi_get in all of the "shortcut keys".

Example:
I want to have a list of all active http sessions (which are assigned by uuid). Do not save everything in one line of a session line. What I use as the key of the string for my cassandra cluster of 6 nodes: _sessions. Then I have a small 16 multi_get keys to retrieve all active sessions, or I can still tell if the session is active just using simple get (if I know its uuid, of course). If your cluster is much larger, you can use the hash function to generate bucket keys.

What is the best practice when developing a Cassandra data model?

More articles: