What does the stage in spark logs mean?

When I start work using spark, do I get the following logs?

[Stage 0:> (0 + 32) / 32]

Here 32 corresponds to the number of rdd partitions I requested.

However, I do not understand why there are several stages and what exactly happens at each stage.

Each stage seems to take a lot of time. Is it possible to do this in several stages?

+8
mapreduce apache-spark pyspark apache-spark-sql
source share
1 answer

A stage in Spark is a segment of DAG computation that runs locally. The stage is broken down into an operation that requires data shuffling, so you will see that it is called this operation in the Spark user interface. If you are using Spark 1.4+, you can even visualize this in the user interface in the DAG visualization section:

enter image description here

Note that the separation occurs in reduceByKey , which requires shuffling to complete the full execution.

+4
source share

All Articles