Do you want to stay clean open? If you are planning on going to an enterprise at some point, many of Hadoop's corporate distributions include Spark analytics.
I have a bias, but there is also a Datastax Enterprise product that integrates Cassandra, Hadoop and Spark, Apache SOLR and other components. It is used by many large Internet companies, in particular, for the applications that you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will post this as well.
If you stay in the cloud, you do not have to choose, you can (depending on your cloud environment, but, for example, with AWS) use Spark for a continuous batch process, Hadoop MapReduce for long-term tasks, analysis of the timeline (analysis of data accumulated in for a long time), etc., since the storage will be separated from collection and processing. Put the data in S3, and then process it later with any engine you need.
If you host equipment, creating a Hadoop cluster will give you the ability to mix hardware (heterogeneous hardware supported by the infrastructure), give you a reliable and flexible storage platform and combination of analysis tools, including HBase and Hive, and have ports for most other things that you mentioned, such as Spark on Hadoop (not the port, actually the original Spark design). This is probably the most versatile platform and can be deployed / deployed cheaply, as the hardware does not have to be the same for each node.
If you are a self-hosting, switching with other parameters of the cluster will lead to hardware requirements, which can be difficult to scale with the subsequent.
suiterdev
source share