I run reduceByKey in sparks. My program is the simplest example of a spark:
val counts = textFile.flatMap(line => line.split(" ")).repartition(20000). .map(word => (word, 1)) .reduceByKey(_ + _, 10000) counts.saveAsTextFile("hdfs://...")
but he always runs out of memory ...
I use 50 servers, 35 artists per server, 140 GB of memory per server.
volume of documents: 8 TB documents, 20 billion documents, a total of 1,000 billion words. and the words after the reduction will be about 100 million.
I wonder how to set the spark configuration?
I wonder what value these parameters should have?
1. the number of the maps ? 20000 for example? 2. the number of the reduces ? 10000 for example? 3. others parameters?
source share