Best approach to Cassandra (+ Spark?) For continuous queries?

We are currently using Hazelcast ( http://hazelcast.org/ ) as a distributed data grid in memory. This worked well for us, but only in memory has exhausted its path in our use case, and we are considering moving our application to NoSQL persistent storage. After the usual comparisons and evaluations, we border on the choice of Cassandra, plus ultimately Spark for analytics.

However, there is a flaw in our architectural needs that we still don’t understand how to solve the problem in Cassandra (with or without Spark): Hazelcast allows us to create a continuous request in that whenever a line is added / removed / modified from the offer result set, Hazelcast is returned with a notification. We use this to constantly update clients using AJAX streaming with new / changed lines.

This is probably the conceptual inconsistency we are making, therefore - how best to solve this use case in Cassandra (with or without Spark help)? Is there something in the API that allows Continuous requests when changing a key / offer (did not find it)? Is there any other way to get the key / clause update stream? Any events?

I know that we could eventually interrogate Cassandra periodically, but in our use-case, the client is potentially interested in a large number of notifications about table offers (think "all changes in the" Ship on the California Coast "position) and iterating from the store will result in Streamer scalability

Therefore, the magical question is: what are we not seeing? Is Kassandra the wrong tool to work with? Do we know about a certain part of the API or an external library in / outside the apache scope that would allow this to be done?

Thanks so much for any help!

Hugo

+7
events cassandra apache-spark
source share
4 answers

I'm not a spark expert, so take this with salt, but maybe you can use this approach:

  • Use spark flow for real-time analytics of incoming data flow and update customer location data in real time.

  • Use Cassandra for persistent storage, cached views, and summary data from which clients can retrieve data.

So, you should write a sparking application that connects to your incoming data stream, presumably that reports the location of the vessel at regular intervals. When he gets the position of the ship, he will look for the last known position of the ship in Kassandra (previously stored in the clustered time series of positions for this ship identifier, reverse sorting by time stamp, so the last position is the first line). If the position of the ship has changed, the spark the application will introduce a new series of time series in Kassandra and push the new position to the client in real time.

Spark will also write other updates for Cassandra for drive operations that might want to know, for example, tables for the number of vessels currently in the San Francisco Bay. When the client clicks on the bay, a collapse table is requested to pull this data for display. Everything that requires a fast response time on the client must be pre-calculated by spark light and stored in Kassandra for quick retrieval.

When a new client is launched, they first request (pull) Cassandra to get the current position of all ships, and then real-time updates of this data will be popped from the spark application.

+1
source share

Use spark flow. When an update is required, perform two operations:

  • Make saveToCassandra which will update cassandra data for future requests.
  • Click on change for customers using what you use. You can make AJAX notification from Spark Streaming if your AJAX-push can be placed in client stream code. Otherwise, you can send a message to some proxy server that will be passed to Ajax clients.

Your code might look something like this:

val notifications = ssc.whateverSourceYouHaveThatGivesADstream(...) notifications.foreachRDD(x => { x.foreachPartition(x => { cassandraConnector.withSessionDo(session => { x.foreach(y => { //use session to update cassandra // broadcast via AJAX or send to proxy to broadcast }) }) }) }) 

Hope this helps.

+1
source share

Check Spark Help Server

You probably want to take a look at the Spark Job Server

It allows you to do things like context context exchange, and thus cash out RDD - between different jobs.

And provides a soothing spark API for near real-time (depending on how often you update caches).

0
source share

Spark Cassandra Connector can help. It supports streaming from the cassandra table:

 import com.datastax.spark.connector.streaming._ val ssc = new StreamingContext(sparkConf, Seconds(1)) val rdd = ssc.cassandraTable("streaming_test", "key_value").select("key", "value").where("fu = ?", 3) 
0
source share

All Articles