How does the Scala compiler handle unused variable values?

Question

How does the Scala compiler handle unused variable values?

Using Scala and Spark, I have the following construction:

val rdd1: RDD[String] = ... val rdd2: RDD[(String, Any)] = ... val rdd1pairs = rdd1.map(s => (s, s)) val result = rdd2.join(rdd1pairs) .map { case (_: String, (e: Any, _)) => e }

The purpose of mapping rdd1 to PairRDD is to combine with rdd2 in the next step. However, I am only interested in rdd2 values, therefore, the display step on the last line, which skips the keys. In fact, this is the intersection between rdd2 and rdd1 performed using Spark join() for efficiency reasons.

My question relates to rdd1pairs keys: they are created only for syntactic reasons (to allow merging) at the first step of the map and are subsequently discarded without any use. How does the compiler handle this? Does using memory s matter in terms of memory consumption (as shown in the example)? Should I replace it with null or 0 to save a little memory? Does the compiler really create and save these objects (links) or notice that they are never used?

+6

performance scala memory apache-spark

Carsten Sep 04 '15 at 7:30

source share

1 answer

mattinbits · Accepted Answer · 2015-09-04T08:03:00+0000

In this case, it seems to me that the Spark driver will affect the result, not the compiler. Regardless of whether Spark can optimize the execution pipeline to avoid creating excessive duplication of s . I'm not sure, but I think Spark will create rdd1pairs in memory.

Instead of mapping to (String, String) you can use (String, Unit) :

 rdd1.map(s => (s,()))

What you do is basically an rdd2 filter based on rdd1 . If rdd1 is significantly smaller than rdd2, another method would be to present rdd1 data as a broadcast variable, not RDD, and just filter rdd2 . This avoids any phase of shuffling or shrinking, so it can be faster, but it will work only if the rdd1 data is small enough to fit on each node.

EDIT:

Given how using a block, not a string, preserves a space, consider the following examples:

 object size extends App { (1 to 1000000).map(i => ("foo"+i, ())) val input = readLine("prompt> ") }

and

 object size extends App { (1 to 1000000).map(i => ("foo"+i, "foo"+i)) val input = readLine("prompt> ") }

Using the jstat command as described in this question How do I test the heap of a running JVM from the command line? the first version uses significantly less heap than the last.

Edit 2:

Unit is actually a singleton object without content, so logically it does not require serialization. The fact that a type definition contains Unit tells you everything you need to deserialize a structure that has a field of type Unit.

By default, Spark uses Java serialization. Consider the following:

 object Main extends App { import java.io.{ObjectOutputStream, FileOutputStream} case class Foo (a: String, b:String) case class Bar (a: String, b:String, c: Unit) val str = "abcdef" val foo = Foo("abcdef", "xyz") val bar = Bar("abcdef", "xyz", ()) val fos = new FileOutputStream( "foo.obj" ) val fo = new ObjectOutputStream( fos ) val bos = new FileOutputStream( "bar.obj" ) val bo = new ObjectOutputStream( bos ) fo writeObject foo bo writeObject bar }

Two files are the same size:

    sr Main$Foo3 , z \ L at Ljava/lang/String;L bq ~ xpt abcdeft xyz

and

    sr Main$Bar+a!N  b L at Ljava/lang/String;L bq ~ xpt abcdeft xyz

How does the Scala compiler handle unused variable values?

More articles: