Efficiency of FlatMap against map with subsequent decrease in Spark

Question

Efficiency of FlatMap against map with subsequent decrease in Spark

I have a sherlock.txt text file containing several lines of text. I load it into a spark shell using:

val textFile = sc.textFile("sherlock.txt")

My goal is to count the number of words in a file. I came across two alternative ways to do this job.

First use flatMap:

textFile.flatMap(line => line.split(" ")).count()

The second use of the card, followed by a decrease:

textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)

Both give the same result correctly. I want to know the difference in time and complexity space of the two above-mentioned alternative implementations, if there really are any?

Does scala translate both into the most efficient form?

+4

scala mapreduce apache-spark rdd flatmap

Sumandeep banerjee Mar 30 '16 at 10:04

1

zero323 · Accepted Answer · 2016-03-30T10:33:46+0000

, map sum:

textFile.map(_.split(" ").size).sum

line.split(" ").

, , , , Array, , .

count :

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Utils.getIteratorSize Iterator sum

_.fold(0.0)(_ + _)

Efficiency of FlatMap against map with subsequent decrease in Spark

More articles: