Using Apache Spark with HDFS and other distributed storage

In the Spark Frequently Asked Questions section, he says that you donโ€™t need to use HDFS:

Do I need Hadoop to run Spark?

No, but if you start the cluster, you will need some form of shared file system (for example, NFS installed on one path on each node). If you have this type of file system, you can simply deploy Spark offline.

So, what are the advantages / disadvantages of using Apache Spark with HDFS and other distributed file systems (e.g. NFS) if I do not plan to use Hadoop MapReduce? Will there be an important feature missing if I use NFS instead of HDFS to store nodes (for breakpoint, accidental overflow, etc.)?

+8
nfs apache-spark
source share
2 answers

After several months and experience with NFS and HDFS, I can answer my question:

NFS allows you to view / modify files on remote computers as if they were saved on a local computer. HDFS can also do this, but it is distributed (unlike NFS), as well as fault tolerant and scalable.

The advantage of using NFS is its ease of configuration, so I would probably use it for QA environments or small clusters. The advantage of HDFS is, of course, its fault tolerance, but the biggest advantage, IMHO, is the ability to use locality when HDFS is shared with Spark nodes, which provides better performance for breakpoints, shuffling, etc.

+9
source share

ScaleOut file systems (NAS), such as Qumulo, have even more advanced high-availability features compared to HDFS: they scale, for example, HDFS, use raw capacity more efficiently due to deletion coding, simple replication, snapshots, failover / return upon failure, backup API, etc.

0
source share

All Articles