Run reduceByKey on huge data in sparks

Question

Run reduceByKey on huge data in sparks

I run reduceByKey in sparks. My program is the simplest example of a spark:

val counts = textFile.flatMap(line => line.split(" ")).repartition(20000). .map(word => (word, 1)) .reduceByKey(_ + _, 10000) counts.saveAsTextFile("hdfs://...")

but he always runs out of memory ...

I use 50 servers, 35 artists per server, 140 GB of memory per server.

volume of documents: 8 TB documents, 20 billion documents, a total of 1,000 billion words. and the words after the reduction will be about 100 million.

I wonder how to set the spark configuration?

I wonder what value these parameters should have?

 1. the number of the maps ? 20000 for example? 2. the number of the reduces ? 10000 for example? 3. others parameters?

+5

apache-spark

user2848932 Jun 30 '15 at 16:58

source share

2 answers

Holden · Answer 1 · 2015-06-30T21:48:40+0000

It would be useful if you placed the logs, but one of them is to specify a larger number of sections when reading in the original text file (for example, sc.textFile(path, 200000) ), rather than re-splitting after reading. Another important thing is to make sure your input file is shared (some compression options make it not shared, in which case Spark might have to read it on the same machine that calls OOM).

Some other parameters, since you are not caching any data, would reduce the amount of memory that Spark allocates for caching (controlled by spark.storage.memoryFraction ), since you only work with string tuples, I would recommend using the org.apache.spark.serializer. KryoSerializer serializer org.apache.spark.serializer. KryoSerializer org.apache.spark.serializer. KryoSerializer .

KyBe · Answer 2 · 2017-05-07T16:09:05+0000

You tried to use partionner , this can help reduce the number of keys on a node, if we assume that the weight of keywords is on average 1ko, it means 100Go of memory exclusively for keys on a node.With separation, you can approximately reduce the number of keys on a node by the number of nodes, reducing accordingly the necessary amount of memery on node. The spark.storage.memoryFraction option mentioned by @Holden is also a key factor.

Run reduceByKey on huge data in sparks

More articles: