What are the options specific to Executor, Driver, and RDD (regarding the storage level of Sping ans)?
From Spark documentation
Performance impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for shuffling, Spark generates many tasks - to compare tasks for organizing data and a set of reducible tasks for aggregating them. This nomenclature comes from MapReduce and does not directly apply to the Sparks map and shortens operations.
Some random operations can consume a significant amount of heap memory because they use data structures in memory to organize records before or after they are transferred. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection .
I'm interested in the limitations of memory/CPU core for the limitations of Spark Job Vs memory/CPU core for Map & Reduce tasks.
Key parameters to compare with Hadoop:
yarn.nodemanager.resource.cpu-vcores mapreduce.map.cpu.vcores mapreduce.reduce.cpu.vcores mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.reduce.shuffle.memory.limit.percent
Key parameters for comparing SPARK parameters with Hadoop for equivalence.
spark.driver.memory spark.driver.cores spark.executor.memory spark.executor.cores spark.memory.fraction
These are just some of the key parameters. See a detailed set of SPARK and Zoom out map
Without the right set of parameters, we cannot compare workplace productivity using two different technologies.
Ravindra babu
source share