What is shuffling, reading, and shuffling in Apache Spark

Question

What is shuffling, reading, and shuffling in Apache Spark

Below is a screenshot of Spark magazine running on port 8080:

enter image description here

The "Shuffle Read" and "Shuffle Write" options are always empty for this code:

import org.apache.spark.SparkContext; object first { println("Welcome to the Scala worksheet") val conf = new org.apache.spark.SparkConf() .setMaster("local") .setAppName("distances") .setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4") .set("spark.executor.memory", "2g") val sc = new SparkContext(conf) def euclDistance(userA: User, userB: User) = { val subElements = (userA.features zip userB.features) map { m => (m._1 - m._2) * (m._1 - m._2) } val summed = subElements.sum val sqRoot = Math.sqrt(summed) println("value is" + sqRoot) ((userA.name, userB.name), sqRoot) } case class User(name: String, features: Vector[Double]) def createUser(data: String) = { val id = data.split(",")(0) val splitLine = data.split(",") val distanceVector = (splitLine.toList match { case h :: t => t }).map(m => m.toDouble).toVector User(id, distanceVector) } val dataFile = sc.textFile("c:\\data\\example.txt") val users = dataFile.map(m => createUser(m)) val cart = users.cartesian(users) // val distances = cart.map(m => euclDistance(m._1, m._2)) //> distances : org.apache.spark.rdd.RDD[((String, String), Double)] = MappedR //| DD[4] at map at first.scala:46 val d = distances.collect // d.foreach(println) //> ((a,a),0.0) //| ((a,b),0.0) //| ((a,c),1.0) //| ((a,),0.0) //| ((b,a),0.0) //| ((b,b),0.0) //| ((b,c),1.0) //| ((b,),0.0) //| ((c,a),1.0) //| ((c,b),1.0) //| ((c,c),0.0) //| ((c,),0.0) //| ((,a),0.0) //| ((,b),0.0) //| ((,c),0.0) //| ((,),0.0) }

Why are the "Shuffle Read" and "Shuffle Write" fields empty? Can the code be configured to fill in these fields to understand how

+8

scala apache-spark

blue-sky Dec 03 '14 at 16:33

source share

2 answers

Mixing means redistributing data between several stages of a spark. “Shuffle Write” is the sum of all recorded serialized data for all artists before transmission (usually at the end of the stage), and “Shuffle Read” means the sum of read serialized data for all artists at the beginning of the stage.

Your program has only one step, initiated by the “collect” operation. No shuffling is required, because you only have a sequence of sequential operations with cards that are pipelined in one scene.

Try a look at these slides: http://de.slideshare.net/colorant/spark-shuffle-introduction

It may also help to read the 5th from the original document: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

+17

taruxtin Jun 08 '15 at 9:58

source share

Soumya simanta · Accepted Answer · 2014-12-03T16:41:56+0000

I believe that you need to run the application in cluster / distributed mode in order to view any read or write values in random order. Usually a "shuffle" is triggered by a subset of Spark actions (for example, groupBy, join, etc.)

What is shuffling, reading, and shuffling in Apache Spark

More articles: