Using Apache Spark with HDFS and other distributed storage

Question

Using Apache Spark with HDFS and other distributed storage

In the Spark Frequently Asked Questions section, he says that you don’t need to use HDFS:

Do I need Hadoop to run Spark?
No, but if you start the cluster, you will need some form of shared file system (for example, NFS installed on one path on each node). If you have this type of file system, you can simply deploy Spark offline.

So, what are the advantages / disadvantages of using Apache Spark with HDFS and other distributed file systems (e.g. NFS) if I do not plan to use Hadoop MapReduce? Will there be an important feature missing if I use NFS instead of HDFS to store nodes (for breakpoint, accidental overflow, etc.)?

+8

nfs apache-spark

kerkero 12 sept '15 at 19:15

source share

2 answers

ScaleOut file systems (NAS), such as Qumulo, have even more advanced high-availability features compared to HDFS: they scale, for example, HDFS, use raw capacity more efficiently due to deletion coding, simple replication, snapshots, failover / return upon failure, backup API, etc.

0

Stefan radtke Jun 12 '19 at 8:39

source share

kerkero · Accepted Answer · 2016-04-14T07:35:16+0000

After several months and experience with NFS and HDFS, I can answer my question:

NFS allows you to view / modify files on remote computers as if they were saved on a local computer. HDFS can also do this, but it is distributed (unlike NFS), as well as fault tolerant and scalable.

The advantage of using NFS is its ease of configuration, so I would probably use it for QA environments or small clusters. The advantage of HDFS is, of course, its fault tolerance, but the biggest advantage, IMHO, is the ability to use locality when HDFS is shared with Spark nodes, which provides better performance for breakpoints, shuffling, etc.

Using Apache Spark with HDFS and other distributed storage

More articles: