I have a confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from disk, modify / process certain columns, and then write them back to disk.
In my experiments, if I use the SparkContext method to parallelize only then this does not seem to affect performance. However, the simple use of Scala parallel collections (through pairs) reduces time by almost half.
I run my experiments in local mode with local [2] arguments for the spark context.
My question is when to use parallel Scala collections, and when to use spark context?
scala parallel-processing apache-spark
MARK
source share