Serializing an object means converting its state into a byte stream so that the byte stream can be returned back to the copy of the object. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subordinate interface, java.io.Externalizable.
The class is never serialized, only the class object is serialized. Serialization of objects is necessary if the object must be stored or transferred over the network.
Class Component Serialization instance variable yes Static instance variable no methods no Static methods no Static inner class no local variables no
Take a Spark Code Example and Consider Various Scenarios
public class SparkSample { public int instanceVariable =10 ; public static int staticInstanceVariable =20 ; public int run(){ int localVariable =30; // create Spark conf final SparkConf sparkConf = new SparkConf().setAppName(config.get(JOB_NAME).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); // create spark context final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); // read DATA JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD(); // Anonymous class used for lambda implementation JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { @Override public Iterator<String> call(String s) { // How will the listed varibles be accessed in RDD across driver and Executors System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable); return Arrays.asList(SPACE.split(s)).iterator(); }); // SAVE OUTPUT words.saveAsTextFile(OUTPUT_PATH)); } // Inner Static class for the funactional interface which can replace the lambda implementation above public static class MapClass extends FlatMapFunction<String, String>() { @Override public Iterator<String> call(String s) { System.out.println("Output :" + instanceVariable + " " + staticInstanceVariable + " " + localVariable); return Arrays.asList(SPACE.split(s)).iterator(); }); public static void main(String[] args) throws Exception { JavaWordCount count = new JavaWordCount(); count.run(); } }
Availability and Serializability of an instance variable from an external class inside internal class objects
Inner class | Instance Variable (Outer class) | Static Instance Variable (Outer class) | Local Variable (Outer class) Anonymous class | Accessible And Serialized | Accessible yet not Serialized | Accessible And Serialized Inner Static class | Not Accessible | Accessible yet not Serialized | Not Accessible
Rule of thumb when understanding Spark:
All lambda functions written inside the RDD are created in the driver, and the objects are serialized and sent to the executors.
If any external class instance variables are accessed in the inner class, the compiler uses different logic to access them, so the outer class becomes serialized or independent of what you are accessing.
In terms of Java, all the considerations concern the Outer class of the vs Inner class and how access to links and variables of the outer class leads to serialization problems.
Various scenarios:
External class The variables available in the Anonymous class are:
Instance variable (outer class)
The default compiler inserts a constructor in byte code
An anonymous class with a reference to an object of the Outer class.
An external class object is used to access an instance variable.
Anonymous Class () {
final Outer-class reference; Anonymous-class( Outer-class outer-reference){ reference = outer-reference; }
}
The outer class is serialized and sent along with the serialized object of the inner anonymous class
Static instance variable (outer class)
Because static variables are not serializable, the outer class of the object is still inserted into the constructor of the anonymous class.
The value of a static variable is taken from the state of the class.
is present on this artist.
Local variable (outer class)
The default compiler inserts a constructor in byte code
An anonymous class with a reference to the Outer class object and the local variable refrence.
An external class object is used to access an instance variable.
Anonymous Class () {
final Outer-class reference; final Local-variable localRefrence ; Anonymous-class( Outer-class outer-reference, Local-variable localRefrence){ reference = outer-reference; this.localRefrence = localRefrence; }
}
The outer class is serialized, and the local variable object is also
serialized and sent along with the serialized object of an internal anonymous class
Because a local variable becomes an instance member inside an anonymous class, it must be serialized. From an external class perspective, a local variable can never be serialized
----------
External class variables accessed with a static inner class.
Instance variable (outer class)
impossible to access
Local variable (outer class)
impossible to access
Static instance variable (outer class)
Since static variables are not serializable, therefore, an object of the outer class is not serializable.
The value of a static variable is taken from the state of the class.
is present on this artist.
The outer class is not serializable and dispatched with the serialized Static inner class
Glasses for thought:
Java Serialization rules are applied to select which class object should be serialized.
Use javap -p -c "abc.class" to expand the byte code and see the code generated by the compiler
Depending on what you are trying to get in the inner class of the outer class, the compiler generates different byte codes.
You do not need to make classes that implement serialization, access to which is possible only on the driver.
Any anonymous / static class (all lambda functions are an anonymous class) used in RDD will be created on the driver.
Any class / variable used inside the RDD will be instantiated to the driver and sent to the executors.
Any variable declared by an instance variable will not be serialized in the driver.
- By default, anonymous classes will force you to make the outer class serializable.
- Any local variable / object does not have to be serialized.
- Only if a local variable is used inside the Anonymous class, should it be serialized
- You can create a singleton inside the call () method for a pair, the mapToPair function, so that it never gets initialized on the driver
- static variables are never serialized, so they are never sent from the driver to the executors
if you need any service that should be executed only on the executor, make them static fields inside the lambda function or make them transient and singelton and check if there is a null condition for creating them