Is HDFS required for Spark workloads?

HDFS is not required, but recommendations appear in some places.

To help evaluate the effort spent launching HDFS:

What are the benefits of using HDFS for Spark workloads?

+3
hadoop hdfs apache-spark mesos mesosphere
source share
4 answers

Spark is a distributed processing engine, and HDFS is a distributed storage system.

If HDFS is not an option, then Spark should use another alternative in the form of Apache Cassandra or Amazon S3.

Look at the comparison

S3 - Non-urgent batch jobs. S3 is suitable for very specific use cases where data locality is not critical.

Cassandra - Ideal for streaming data analysis and excess for batch jobs.

HDFS - great for batch jobs without sacrificing data locality.

When to use HDFS as a storage engine for distributed Spark processing?

  • If you already have a large Hadoop cluster and are looking for real-time analytics of your data, Spark can use an existing Hadoop cluster. This will reduce development time.

  • Spark is a computer in memory. Since data cannot always be inserted into memory, for some operations the data must be spilled to disk . In this case, Spark comes in handy from HDFS. Spark's Teragen sort record was used by the HDFS store for sorting.

  • HDFS is a scalable, reliable, and resilient distributed file system (starting with Hadoop 2.x). With the principle of data locality, processing speed is improved.

  • Best for batch processing tasks .

+2
source share

HDFS (or any distributed file system) greatly simplifies the distribution of your data. Using the local file system, you will have to split / copy the data manually into separate nodes and be aware of the distribution of data when performing your tasks. In addition, HDFS also handles failed node failures. From the integration between Spark and HDFS, you can imagine how knowing the distribution of data to try to schedule tasks in the same nodes where the required data is located.

Secondly: what problems did you encounter exactly with the instruction?

BTW: If you're just looking for easy setup on AWS, DCOS allows you to install HDFS with a single command ...

0
source share

The shortest answer: "No, you do not need it." You can analyze data even without HDFS, but of course you need to replicate data on all your nodes.

The long answer is rather inconsistent, and I'm still trying to figure it out with the stackoverflow community.

Fixed local vs hdfs error

0
source share

So, you can go with the Cloudera or Hortenworks distribution and easily download the entire stack. CDH will be used with YARN, although it's harder for me to configure mesos in CDH. Horten is much easier to set up.

HDFS is excellent because of datanodes = data localization (the process where the data is), since shuffling / transferring data is very expensive. HDFS also naturally locks files that allow Spark to block. (128 MB blocks, you can change this).

You can use S3 and Redshift.

See here: https://github.com/databricks/spark-redshift

-one
source share

All Articles