Social networks: Hadoop, HBase, Spark over MongoDB or Postgres?

I am engaged in the architecture of a social network that includes various functions, many of which work with intensive workloads with large amounts of data (such as machine learning). For example: recommender systems, search engines, and time series.

Given that I currently have 5 <users, but to anticipate significant growth - which indicators should I use to decide between:

  • Spark (with / without HBase over Hadoop)
  • MongoDB or Postgres

A look at Postgres as a means of reducing the transfer pressure between it and Spark (use an SQL abstraction layer that works on both). Spark seems pretty interesting, can present various ML, SQL, and Graph questions that can be made for quick answers. MongoDB is what I usually use, but I found that its zoom and map features are very limited.

+7
postgresql mongodb hadoop bigdata apache-spark
source share
6 answers

I think you are in the right direction to find a stack / software architecture that can:

  • handle various types of loads: batch, real-time computing, etc.
  • scale and speed with business growth
  • - full software stack supported and supported
  • have common library support for specific domain computing, such as machine learning, etc.

For these benefits, Hadoop + Spark can give you the edge you need. Hadoop is currently relatively mature to handle large-scale data in batch mode. It supports robust and scalable storage (HDFS) and computing (Mapreduce / Yarn). With the addition of Spark, you can use storage (HDFS) and real-time computing (performance) added by Spark.

In terms of development, both systems are supported by Java / Scala. Support for libraries whose performance tuning is abundant here on stackoverflow and everywhere. There are at least several machine learning libraries (Mahout, Mlib) that work with chaos, sparks.

For deployment, AWS and other cloud providers can provide a host / spark solution. Not a problem there either.

+5
source share

I think you need to separate data storage and data processing. In particular, "Spark or MongoDB"? is it not good to ask, but rather, "Spark or Hadoop or Storm"? as well as "MongoDB or Postgres or HDFS?"

In any case, I would refrain from processing the database.

+1
source share

I have to admit that I'm a little biased, but if you want to learn something new, you have serious free time, you are ready to read a lot, and you have the resources (in terms of infrastructure), go for HBase *, you don’t regret it. An entire new universe of possibilities and interesting functions opens up, when you can have + billions of atomic counters in real time.

* Along with Hadoop, Hive, Spark ...

+1
source share

In my opinion, this depends more on your requirements and the amount of data that you will have than on the number of users, which is also a requirement. Hadoop (aka Hive / Impala, HBase, MapReduce, Spark, etc.) Works great with large volumes - GB / TB per day - of data and scales horizontally very well.

In the Big Data environments I worked with, I always used Hadoop HDFS to store raw data and use a distributed file system to analyze data using Apache Spark. The results were stored in a database system such as MongoDB to receive low latency queries or fast aggregates with many concurrent users. Then we used Impala to get demmand analytics. The main issue when using so many technologies is to scale the infrastructure and resources provided to each of them well. For example, Spark and Impala consume a lot of memory (they are located in memory modules), so it’s a good idea to place a MongoDB instance on the same machine.

I also offer you a database of graphs, as you are building a social network architecture; but I have no experience with this ...

+1
source share

Do you want to stay clean open? If you are planning on going to an enterprise at some point, many of Hadoop's corporate distributions include Spark analytics.

I have a bias, but there is also a Datastax Enterprise product that integrates Cassandra, Hadoop and Spark, Apache SOLR and other components. It is used by many large Internet companies, in particular, for the applications that you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

You want to think about how you will post this as well.

If you stay in the cloud, you do not have to choose, you can (depending on your cloud environment, but, for example, with AWS) use Spark for a continuous batch process, Hadoop MapReduce for long-term tasks, analysis of the timeline (analysis of data accumulated in for a long time), etc., since the storage will be separated from collection and processing. Put the data in S3, and then process it later with any engine you need.

If you host equipment, creating a Hadoop cluster will give you the ability to mix hardware (heterogeneous hardware supported by the infrastructure), give you a reliable and flexible storage platform and combination of analysis tools, including HBase and Hive, and have ports for most other things that you mentioned, such as Spark on Hadoop (not the port, actually the original Spark design). This is probably the most versatile platform and can be deployed / deployed cheaply, as the hardware does not have to be the same for each node.

If you are a self-hosting, switching with other parameters of the cluster will lead to hardware requirements, which can be difficult to scale with the subsequent.

+1
source share

We use Spark + Hbase + Apache Phoenix + Kafka + ElasticSearch and scaling up to now has been easy.

* Phoenix is ​​a JDBC driver for Hbase, it allows you to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (through the JDBC river), this really simplifies the integration.

+1
source share

All Articles