What does storing data in memory mean in the context of Apache Spark?

I read that Apache Spark stores data in memory. However, Apache Spark is designed to analyze huge amounts of data (aka great data analytics). In this context, what does storing data in memory really mean? Is the data that it can store available limited to RAM? How does its data store compare with Apache Hadoop using HDFS?

+7
hadoop apache-spark
source share
1 answer

In Hadoop, data is saved to disk between stages, so a typical multi-step job looks something like this:

hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs 

This is a brilliant design, and it makes sense to use when you are batch files that are well suited to the card. But for some workloads this can be very slow - iterative algorithms are especially negatively affected. You took the time to create some kind of data structure (for example, a graph), and all you want to do at each step is updating the score. Saving and reading the entire graph to / from the disk will slow down your work.

Spark uses a more general engine that supports cyclic data streams, and will try to keep things in memory between work steps. This means that if you can create a data structure and partitioning strategy where your data does not mix between each step of your job, you can effectively update it without serializing and writing everything to disk between steps. It is for this reason that Spark got a chart on its first page showing 100x acceleration during logical regression.

If you write a Spark job that simply calculates the value from each input line in your dataset and writes it back to disk, Hadoop and Spark will be pretty much equal in terms of performance (startup time is faster in Spark, but that hardly matters when we spend hours processing data in one step).

If Spark cannot hold the RDD in memory between steps, it will put it to disk, as Hadoop does. But remember that Spark is not a silver bullet, and there will be corner cases where you have to battle the nature of Spark in memory, causing OutOfMemory problems where Hadoop just writes everything to disk.

I personally like to think of it this way: in your 500 cluster of computers, 64 GB of Hadoop is designed to efficiently batch process your job 500 TB faster, spreading read and write to disk. Spark takes advantage of the fact that 500 * 64 GB = 32 TB of memory can solve many of your problems completely in memory!

+9
source share

All Articles