Cassandra + Solr / Hadoop / Spark - choosing the right tools

I am currently studying how to store and analyze enriched time-based data, up to 1000 columns per row. At the moment, Cassandra, along with Solr, Hadoop or Spark offered by Datastax Enterprise, seem to meet my rough requirements. But the devil is in the details.

Of the 1000 columns, about 60 are used for requests in real time (web interface, the user submits a form and awaits a quick response). These queries are more or less GROUPBY statements that count numbers or occurrences.

Since Kassandra itself does not provide the required analytical capabilities (no GROUPBY), I left these alternatives:

  • Roughly query through Cassandra and filter the result set in self-copying code
  • Index data using Solr and run facet.pivot queries
  • Use either Hadoop or Spark and run the queries

The first approach seems cumbersome and error prone ... Solr has some anayltic functions, but without multi-level grouping, I got stuck in the corners. I don’t know if this is a good or effective approach, though ... And last but not least: Hadoop and Spark, previously known were not the best for real-time queries, later, rather new, and possibly not ready to production.

And which way? There is no one here, but before I go one way, I would like to receive feedback. Maybe I think it’s difficult or my expectations are too high: S

Thanks in advance,

Arman

+6
source share
2 answers

In the place where I work, we now have a similar set of technical requirements, and the solution is Cassandra-Solr-Spark, exactly in that order.

So, if a request can be "covered" by Cassandra indices - well, if not - it is covered by Solr. For testing and less frequently requests - Spark (Scala, there is no SparkSQL because of its old version - it is a bank, everything needs to be tested and matured, from cognac to software, argh).

In general, I agree with the decision, although sometimes I feel that some customer requests should NOT be taken seriously, saving us from many strange requests :)

+2
source

I would recommend Spark, if you take mining in the list of companies using it , you will have names like Amazon, eBay and Yahoo! . In addition, as you noted in the commentary, it is becoming a mature tool.

You have already given arguments against Cassandra and Solr, so I will focus on explaining why Hadoop MapReduce will not do the same as Spark for real-time queries.

Hadoop and MapReduce were designed to use the hard drive under the assumption that IO is negligible for big data. As a result, data is read and written at least twice - at the map stage and at the reduction stage. This allows you to recover from failures, since the partial result is protected, but it does not want you to want by sending requests in real time.

The spark is intended not only to eliminate the shortcomings of MapReduce, but also for the interactive analysis of the data that you want. This goal is achieved mainly by using RAM , and the results are amazing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.

The only caveat is the amount of memory you have. Most likely, your data is likely to be useful in RAM , you can provide , or you can rely on sampling . Usually when interacting with data there is no real need to use MapReduce, and this seems to be the case in your case.

+1
source

All Articles