How to request Cassandra by date

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I would like to request Cassandra for these new entries so that I can process this data on another system.

I meant that I can use TimeUUIDType as a key for my record, and then request a KeyRange that starts either with a "" as startKey, or whatever lastStartKey is. Is this the correct method?

How does get_range_slice really create a range? Shouldn't he know the key data type? There is no declaration of key data type anywhere. In the storage_conf.xml file, you specify the type of columns, but not the keys. Is the key supposed to be the same type as the columns? Or does it make some kind of magical sniff to guess?

I also saw reference implementations in which people store TimeUUIDType in columns. However, this seems to have scale issues, as this particular key will become hot because each change will need to update it.

Any pointers in this case will be appreciated.

+7
cassandra nosql
source share
3 answers

When sorting data, only columns are important. The data stored in the data is irrelevant, and this is not automatic timestamp generation. The CompareWith attribute is important here. If you install CompareWith as UTF8Type, then the keys will be interpreted as UTF8Types. If you set CompareWith as TimeUUIDType, then the keys are automatically interpreted as timestamps. You do not need to specify a data type. Look at the definitions of SlicePredicate and SliceRange on this page http://wiki.apache.org/cassandra/API This is a good place to start. In addition, you may find this article useful http://www.sodeso.nl/?p=80 In the third part or so, he talks about the section that defines his requests, etc.

+2
source share

Arc,

Writing to the same column family can sometimes create a hot spot if you use the Partitioner to save orders, but not if you use the random Partitioner by default (unless a subset of users creates much more data than all other users!).

If you sorted your series by time (using the Partitioner to save orders), you are likely to create hotspots, as you will add rows sequentially, and one node will be responsible for each key space range.

0
source share

Columns and keys can be of any type, since the row key is only the first column. In fact, the cluster is a ring hash ring, and the keys are hashed using a delimiter to distribute it across the cluster.

Beware of using dates as row keys, since even the randomization of the standard randompartitioner is limited, and you can end up cluttering your data.

What else if this date changes, you will have to delete the previous line, since you can only do inserts in C *.

Here is what we know:

  • A slice range is a range of columns in a row with an initial value and an end value, which is used mainly for large rows in column order. Known column names defined in CF are indexed, so you can get them by specifying the names.
  • A key slice is a key associated with the range of cut columns returned by Cassandra
  • The equivalent of the where clause uses secondary indexes, you can use the inequality operators there, but your statement should have at least one equality condition (also see https://issues.apache.org/jira/browse/CASSANDRA-1599 ).
  • Using a range of keys is inefficient with Random Partitionner, since the MD5 hash of your key does not support lexical ordering.

What you want to use is a column-based index using the Wide Row: CompositeType (TimeUUID | UserID) To prevent this from getting hot, add the first meaningful key ("shard key") that would divide the data into nodes such as a custom type or region.

Having more data than necessary in Cassandra is not a problem, so it should be designed, so you should ask yourself: โ€œwhat do I need to requestโ€ and then design a family of columns for it, and not try to fit all in one CF, as in RDBMS.

0
source share

All Articles