Understanding Spark Serialization

In Spark, how do you know which objects are created on the driver and which are created on the executor, and, therefore, how to determine which classes need to be implemented Serializable?

+23
lambda serialization apache-spark spark-dataframe
source share
2 answers

Serializing an object means converting its state into a byte stream so that the byte stream can be returned back to the copy of the object. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subordinate interface, java.io.Externalizable.

The class is never serialized, only the class object is serialized. Serialization of objects is necessary if the object must be stored or transferred over the network.

Class Component Serialization instance variable yes Static instance variable no methods no Static methods no Static inner class no local variables no 

Take a Spark Code Example and Consider Various Scenarios

 public class SparkSample { public int instanceVariable =10 ; public static int staticInstanceVariable =20 ; public int run(){ int localVariable =30; // create Spark conf final SparkConf sparkConf = new SparkConf().setAppName(config.get(JOB_NAME).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); // create spark context final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); // read DATA JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD(); // Anonymous class used for lambda implementation JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { @Override public Iterator<String> call(String s) { // How will the listed varibles be accessed in RDD across driver and Executors System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable); return Arrays.asList(SPACE.split(s)).iterator(); }); // SAVE OUTPUT words.saveAsTextFile(OUTPUT_PATH)); } // Inner Static class for the funactional interface which can replace the lambda implementation above public static class MapClass extends FlatMapFunction<String, String>() { @Override public Iterator<String> call(String s) { System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable); return Arrays.asList(SPACE.split(s)).iterator(); }); public static void main(String[] args) throws Exception { JavaWordCount count = new JavaWordCount(); count.run(); } } 

Availability and Serializability of an instance variable from an external class inside internal class objects

  Inner class | Instance Variable (Outer class) | Static Instance Variable (Outer class) | Local Variable (Outer class) Anonymous class | Accessible And Serialized | Accessible yet not Serialized | Accessible And Serialized Inner Static class | Not Accessible | Accessible yet not Serialized | Not Accessible 

Rule of thumb when understanding Spark:

  • All lambda functions written inside the RDD are created in the driver, and the objects are serialized and sent to the executors.

  • If any external class instance variables are accessed in the inner class, the compiler uses different logic to access them, so the outer class becomes serialized or independent of what you are accessing.

  • In terms of Java, all the considerations concern the Outer class of the vs Inner class and how access to links and variables of the outer class leads to serialization problems.

Various scenarios:

External class The variables available in the Anonymous class are:


Instance variable (outer class)

The default compiler inserts a constructor in byte code

An anonymous class with a reference to an object of the Outer class.

An external class object is used to access an instance variable.

Anonymous Class () {

  final Outer-class reference; Anonymous-class( Outer-class outer-reference){ reference = outer-reference; } 

}

The outer class is serialized and sent along with the serialized object of the inner anonymous class


Static instance variable (outer class)

Because static variables are not serializable, the outer class of the object is still inserted into the constructor of the anonymous class.

The value of a static variable is taken from the state of the class.

is present on this artist.


Local variable (outer class)

The default compiler inserts a constructor in byte code

An anonymous class with a reference to the Outer class object and the local variable refrence.

An external class object is used to access an instance variable.

Anonymous Class () {

  final Outer-class reference; final Local-variable localRefrence ; Anonymous-class( Outer-class outer-reference, Local-variable localRefrence){ reference = outer-reference; this.localRefrence = localRefrence; } 

}

The outer class is serialized, and the local variable object is also

serialized and sent along with the serialized object of an internal anonymous class

Because a local variable becomes an instance member inside an anonymous class, it must be serialized. From an external class perspective, a local variable can never be serialized

----------

External class variables accessed with a static inner class.

Instance variable (outer class)

impossible to access


Local variable (outer class)

impossible to access


Static instance variable (outer class)

Since static variables are not serializable, therefore, an object of the outer class is not serializable.

The value of a static variable is taken from the state of the class.

is present on this artist.

The outer class is not serializable and dispatched with the serialized Static inner class


Glasses for thought:

  • Java Serialization rules are applied to select which class object should be serialized.

  • Use javap -p -c "abc.class" to expand the byte code and see the code generated by the compiler

  • Depending on what you are trying to get in the inner class of the outer class, the compiler generates different byte codes.

  • You do not need to make classes that implement serialization, access to which is possible only on the driver.

  • Any anonymous / static class (all lambda functions are an anonymous class) used in RDD will be created on the driver.

  • Any class / variable used inside the RDD will be instantiated to the driver and sent to the executors.

  • Any variable declared by an instance variable will not be serialized in the driver.

    1. By default, anonymous classes will force you to make the outer class serializable.
    2. Any local variable / object does not have to be serialized.
    3. Only if a local variable is used inside the Anonymous class, should it be serialized
    4. You can create a singleton inside the call () method for a pair, the mapToPair function, so that it never gets initialized on the driver
    5. static variables are never serialized, so they are never sent from the driver to the executors
if you need any service that should be executed only on the executor, make them static fields inside the lambda function or make them transient and singelton and check if there is a null condition for creating them
+38
source share

there are many very well written blogs that explain this very well, for example this one: the challenge of serialization .

but in short, we can conclude as follows (only Spark, not the JVM in general):

  1. because of the JVM, only objects are serialized (functions are objects)
  2. if an object must be serialized, its parent must also be serialized
  3. any Spark operations, such as (map, flatMap, filter, foreachPartition, mapPartition, etc.), if the inner part has a reference to the outer part object, this object must be serialized. Because the objects of the external part are in the Driver, and not in the Executors. And serialization policy refers to my point number 2.
  4. the link to the Scala object (also known as the Scala singleton) will not be serialized (only for mapPartition and foreachPartition, UDF will always receive serde from the driver to the executor). artists will refer directly to their local JVM object since they will exist in the artists JVM. This means that the driver mutation on its local object will not be visible from the executors.
+2
source share

All Articles