Understanding parallelism in Spark and Scala

Question

Understanding parallelism in Spark and Scala

I have a confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from disk, modify / process certain columns, and then write them back to disk.

In my experiments, if I use the SparkContext method to parallelize only then this does not seem to affect performance. However, the simple use of Scala parallel collections (through pairs) reduces time by almost half.

I run my experiments in local mode with local [2] arguments for the spark context.

My question is when to use parallel Scala collections, and when to use spark context?

+7

scala parallel-processing apache-spark

MARK Nov 04 '13 at 18:49

source share

2 answers

Utgarda · Answer 1 · 2013-11-22T16:19:44+0000

SparkContext parallelize can make your collection suitable for processing on multiple nodes, as well as on multiple local kernels of your only working instance (local [2]), but again, you probably get too much overhead from running Spark Task Scheduler all this magic . Of course, Scala parallel collections should be faster on one machine.

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - your files are large enough to be automatically divided into several fragments, did you try to edit the number of sections manually ?

Have you tried to run the same Spark operation on one core and then on two cores?

Expect a better result from Spark with one really large, uniformly structured file, and not with a few smaller files.

samthebest · Answer 2 · 2014-01-01T13:19:54+0000

SparkContext will have additional processing to maintain the commonality of several nodes, it will be constant in data size, therefore it can be insignificant for huge data sets. At 1 node, this overhead will be slower than Scala parallel collections.

Use Spark When

You have more than 1 node
You want your work ready to scale to multiple nodes.
Spark overhead on 1 node is negligible since the data is huge, so you can choose a richer infrastructure

Understanding parallelism in Spark and Scala

More articles: