Spark is a distributed processing engine, and HDFS is a distributed storage system.
If HDFS is not an option, then Spark should use another alternative in the form of Apache Cassandra or Amazon S3.
Look at the comparison
S3 - Non-urgent batch jobs. S3 is suitable for very specific use cases where data locality is not critical.
Cassandra - Ideal for streaming data analysis and excess for batch jobs.
HDFS - great for batch jobs without sacrificing data locality.
When to use HDFS as a storage engine for distributed Spark processing?
If you already have a large Hadoop cluster and are looking for real-time analytics of your data, Spark can use an existing Hadoop cluster. This will reduce development time.
Spark is a computer in memory. Since data cannot always be inserted into memory, for some operations the data must be spilled to disk . In this case, Spark comes in handy from HDFS. Spark's Teragen sort record was used by the HDFS store for sorting.
HDFS is a scalable, reliable, and resilient distributed file system (starting with Hadoop 2.x). With the principle of data locality, processing speed is improved.
Best for batch processing tasks .
Ravindra babu
source share