Cassandra Internal Storage

Question

Cassandra Internal Storage

I am trying to understand what exactly happens internally at the core of the repository when the row (s) are inserted into the CQL style sheet.

CREATE TABLE log_date ( userid bigint, time timeuuid, category text, subcategory text, itemid text, count int, price int, PRIMARY KEY ((userid), time) - #1 PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2 );

Suppose I have a table as above.

In case # 1, the CQL row will generate 6 (or 5?) Columns in the repository.
In case # 2, the CQL row will generate a very complex column in the repository.

I am wondering what a more efficient way to store logs in Cassandra.
Please focus on those who posed two situations.
I do not need real-time readings. Just scripture.

If you want to suggest other options, refer to the following. The reasons I chose Cassandra to store logs is because

Linear scalability and good for heavy recordings.
CQL has a schema. I really prefer to have a circuit.
It seems to support Spark quite well. The Datastax cassandra-spark plug seems to have location information.

+5

cassandra time-series cql apache-spark

Woojun kim May 08 '15 at 2:32

source share

1 answer

Aaron · Accepted Answer · 2015-05-08T14:24:03+0000

I am trying to understand what exactly happens internally at the core of the repository when the row (s) are inserted into the CQL style sheet.

Let's say that I am building tables with both of your PRIMARY KEYS and INSERT some data:

 aploetz@cql sh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1; userid | time | dateof(time) | category | subcategory | itemid | count | price --------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+------- 1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799 1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200 1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599 (5 rows) aploetz@cqlsh :stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2; userid | time | dateof(time) | category | subcategory | itemid | count | price --------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+------- 1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799 1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200 1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599 (5 rows)

It looks about the same through cqlsh . So let's take a look from cassandra-cli and request all the lines of foor userid 1002:

 RowKey: 1002 => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)

Simple enough, right? We see userid 1002 as RowKey and our clustered time column as the column key. After that, all columns for each column ( time ). And I believe that your first instance generates 6 columns, since I'm sure it includes a placeholder for the column key, because your PRIMARY KEY can point to a null value (like your second example key).

But what about the second version for userid 1002?

 RowKey: 1002 => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)

For RowKey 1002, two columns are returned, one for each unique combination of our columns (clusters), with an empty value (as mentioned above).

So what does all this mean to you? Well, a few things:

This should tell you that the KEY KEYS in Kassandra provide uniqueness. Therefore, if you decide that you need to update key values, such as category or subcategory (second example), that you really cannot, unless you delete and re-create the line. Although in terms of journaling, perhaps a good thing.
Cassandra stores all the data for a particular section / row key ( userid ) together, sorted by column (clustering). If you were concerned about querying and sorting your data, it would be important to understand that you would need to request a sort order for each specific userid order to make a difference.
The biggest problem that I see is that now you are setting yourself up for unlimited column growth. Partition / row keys can support a maximum of 2 billion columns, so your second example will help you as much as possible. If you think that some of your userid may exceed this, you can implement the "date bucket" as an additional key for the section (say, if you knew that userid would never exceed more than 2 billion a year or something else).

It seems to me that your second option may be the best choice. But honestly, for what you do, any of them will probably work fine.

Cassandra Internal Storage

More articles: