I am trying to understand what exactly happens internally at the core of the repository when the row (s) are inserted into the CQL style sheet.
Let's say that I am building tables with both of your PRIMARY KEYS and INSERT some data:
aploetz@cql sh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1; userid | time | dateof(time) | category | subcategory | itemid | count | price --------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+------- 1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799 1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200 1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599 (5 rows) aploetz@cqlsh :stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2; userid | time | dateof(time) | category | subcategory | itemid | count | price --------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+------- 1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799 1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200 1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798 1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599 (5 rows)
It looks about the same through cqlsh . So let's take a look from cassandra-cli and request all the lines of foor userid 1002:
RowKey: 1002 => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568) => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)
Simple enough, right? We see userid 1002 as RowKey and our clustered time column as the column key. After that, all columns for each column ( time ). And I believe that your first instance generates 6 columns, since I'm sure it includes a placeholder for the column key, because your PRIMARY KEY can point to a null value (like your second example key).
But what about the second version for userid 1002?
RowKey: 1002 => (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994) => (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)
For RowKey 1002, two columns are returned, one for each unique combination of our columns (clusters), with an empty value (as mentioned above).
So what does all this mean to you? Well, a few things:
- This should tell you that the KEY KEYS in Kassandra provide uniqueness. Therefore, if you decide that you need to update key values, such as
category or subcategory (second example), that you really cannot, unless you delete and re-create the line. Although in terms of journaling, perhaps a good thing. - Cassandra stores all the data for a particular section / row key (
userid ) together, sorted by column (clustering). If you were concerned about querying and sorting your data, it would be important to understand that you would need to request a sort order for each specific userid order to make a difference. - The biggest problem that I see is that now you are setting yourself up for unlimited column growth. Partition / row keys can support a maximum of 2 billion columns, so your second example will help you as much as possible. If you think that some of your
userid may exceed this, you can implement the "date bucket" as an additional key for the section (say, if you knew that userid would never exceed more than 2 billion a year or something else).
It seems to me that your second option may be the best choice. But honestly, for what you do, any of them will probably work fine.