Inner spark work

Question

Inner spark work

The spark is ongoing. Spark used the scala language to download and execute the program, as well as python and java. RDD is used to store data. But I cannot understand the Spark architecture, how it works internally.

Please tell Spark Architecture, and also how does it work internally?

+20

apache-spark

Monika Bhajipale Jun 07 '15 at 8:08

source share

1 answer

Sathish · Answer 1 · 2015-06-07 08:46

Even I searched on the Internet to find out about the internal components of Spark, below I found out and thought about sharing here,

Spark revolves around the concept of a fault tolerant distributed data set (RDD), which is a fault tolerant set of elements that can work in parallel. RDDs support two types of operations: transformations that create a new dataset from an existing one, and actions that return a value to the driver program after starting the calculation in the dataset.

Spark converts the RDD transforms into something called a DAG (Directed Acyclic Graph), and starts execution,

At a high level, when any action is called in the RDD, Spark creates a DAG and is sent to the DAG scheduler.

The DAG scheduler divides operators into task steps. The stage consists of tasks based on input data sections. DAG scheduler operators work together. E.g. Many card operators can be scheduled in one step. The end result of the DAG scheduler is a set of steps.
Stages are passed to the task scheduler. Task Scheduler launches tasks through the cluster manager (Spark Standalone / Yarn / Mesos). The task scheduler does not know about the dependencies of the stages.
Worker performs tasks on the slave.

Let's see how Spark creates a DAG.

At a high level, there are two transformations that can be applied to RDD, namely narrow transform and wide transform . Broad transformations mainly lead to the boundaries of the scene.

Narrow transformation - does not require shuffling data on partitions. for example, Map, Filter, etc.

wide conversion - requires data to be shuffled, e.g. reduceByKey, etc.

Let's look at an example of counting the number of log messages at each severity level,

The following is a log file starting with severity,

INFO I'm Info message WARN I'm a Warn message INFO I'm another Info message

and create the following scala code to extract the same,

 val input = sc.textFile("log.txt") val splitedLines = input.map(line => line.split(" ")) .map(words => (words(0), 1)) .reduceByKey{(a,b) => a + b}

This command sequence implicitly defines the DAG of the RDD lineage objects that will be used later when the action is invoked. Each RDD maintains a pointer to one or more parent elements along with metadata about what type of relationship it has with the parent. For example, when we call val b = a.map () on RDD, RDD b keeps a reference to its parent a, that line.

To display the RDD line, Spark offers a debugging method toDebugString () . For example, executing the toDebugString () command in splitedLines RDD will display the following,

 (2) ShuffledRDD[6] at reduceByKey at <console>:25 [] +-(2) MapPartitionsRDD[5] at map at <console>:24 [] | MapPartitionsRDD[4] at map at <console>:23 [] | log.txt MapPartitionsRDD[1] at textFile at <console>:21 [] | log.txt HadoopRDD[0] at textFile at <console>:21 []

The first line (bottom) shows the input RDD. We created this RDD by calling sc.textFile (). See below for a more schematic view of the DAG plot created from this RDD.

RDD DAG graph

After creating the DAG, the Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler breaks the schedule into several stages, the stages are created on the basis of transformations. Narrow transformations will be grouped (with the tube) together in one step. So, for our example, Spark will create two-step execution as follows:

Stages

The DAG scheduler then sends the steps to the task scheduler. The number of tasks assigned depends on the number of sections present in the text box. An example of Fox is that in this example we have 4 sections, then 4 sets of tasks will be created that will be presented in parallel if there are enough slaves / cores. The diagram below illustrates this in more detail.

Task execustion

For more information, I suggest you watch the following YouTube videos, where the creators of Spark talk in detail about the DAG and the implementation plan and lifetime.

Inner spark work

More articles: