Sparking large file

This may be a dumb question. I want to make sure that I get it right.

When you are a (400GB)large cluster (400GB)in a cluster where the collective memory of the artist is only about 120GB, Spark seems to read forever. It does not crash and does not start the first map task.

I think what happens: Spark reads a large file in the form of streams and starts discarding old lines when the performers run out of memory. This obviously can be a problem when code execution starts .map, as the jvm executor will read the file again from the very beginning. I'm curious though, Spark somehow spills data onto the hard drive, like a spill mechanism.

Please note, I do not mean the caching process. This is due to the initial reading usingsc.textFile(filename)

+7
source share
1 answer

sc.textFilereading does not start. It simply defines the structure of resident data that can be used for further processing.

Only when an action is invoked in RDD will Spark create a strategy to perform all necessary transformations (including reading), and then return the result.

, , , , Spark ( ), , .

(defaultMinPartitions), , , java- (InputSplit HDFS) , , textFile. , ( ). , , :

sc.textFile(file, numPartitions)
  .count()  

, : reduceByKey

+12

All Articles