Understanding parallelism in Spark and Scala

I have a confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from disk, modify / process certain columns, and then write them back to disk.

In my experiments, if I use the SparkContext method to parallelize only then this does not seem to affect performance. However, the simple use of Scala parallel collections (through pairs) reduces time by almost half.

I run my experiments in local mode with local [2] arguments for the spark context.

My question is when to use parallel Scala collections, and when to use spark context?

+7
scala parallel-processing apache-spark
source share
2 answers

SparkContext parallelize can make your collection suitable for processing on multiple nodes, as well as on multiple local kernels of your only working instance (local [2]), but again, you probably get too much overhead from running Spark Task Scheduler all this magic . Of course, Scala parallel collections should be faster on one machine.

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - your files are large enough to be automatically divided into several fragments, did you try to edit the number of sections manually ?

Have you tried to run the same Spark operation on one core and then on two cores?

Expect a better result from Spark with one really large, uniformly structured file, and not with a few smaller files.

+3
source share

SparkContext will have additional processing to maintain the commonality of several nodes, it will be constant in data size, therefore it can be insignificant for huge data sets. At 1 node, this overhead will be slower than Scala parallel collections.

Use Spark When

  • You have more than 1 node
  • You want your work ready to scale to multiple nodes.
  • Spark overhead on 1 node is negligible since the data is huge, so you can choose a richer infrastructure
+3
source share

All Articles