In Hadoop, data is saved to disk between stages, so a typical multi-step job looks something like this:
hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs
This is a brilliant design, and it makes sense to use when you are batch files that are well suited to the card. But for some workloads this can be very slow - iterative algorithms are especially negatively affected. You took the time to create some kind of data structure (for example, a graph), and all you want to do at each step is updating the score. Saving and reading the entire graph to / from the disk will slow down your work.
Spark uses a more general engine that supports cyclic data streams, and will try to keep things in memory between work steps. This means that if you can create a data structure and partitioning strategy where your data does not mix between each step of your job, you can effectively update it without serializing and writing everything to disk between steps. It is for this reason that Spark got a chart on its first page showing 100x acceleration during logical regression.
If you write a Spark job that simply calculates the value from each input line in your dataset and writes it back to disk, Hadoop and Spark will be pretty much equal in terms of performance (startup time is faster in Spark, but that hardly matters when we spend hours processing data in one step).
If Spark cannot hold the RDD in memory between steps, it will put it to disk, as Hadoop does. But remember that Spark is not a silver bullet, and there will be corner cases where you have to battle the nature of Spark in memory, causing OutOfMemory problems where Hadoop just writes everything to disk.
I personally like to think of it this way: in your 500 cluster of computers, 64 GB of Hadoop is designed to efficiently batch process your job 500 TB faster, spreading read and write to disk. Spark takes advantage of the fact that 500 * 64 GB = 32 TB of memory can solve many of your problems completely in memory!
jkgeyti
source share