In this case, it seems to me that the Spark driver will affect the result, not the compiler. Regardless of whether Spark can optimize the execution pipeline to avoid creating excessive duplication of s . I'm not sure, but I think Spark will create rdd1pairs in memory.
Instead of mapping to (String, String) you can use (String, Unit) :
rdd1.map(s => (s,()))
What you do is basically an rdd2 filter based on rdd1 . If rdd1 is significantly smaller than rdd2, another method would be to present rdd1 data as a broadcast variable, not RDD, and just filter rdd2 . This avoids any phase of shuffling or shrinking, so it can be faster, but it will work only if the rdd1 data is small enough to fit on each node.
EDIT:
Given how using a block, not a string, preserves a space, consider the following examples:
object size extends App { (1 to 1000000).map(i => ("foo"+i, ())) val input = readLine("prompt> ") }
and
object size extends App { (1 to 1000000).map(i => ("foo"+i, "foo"+i)) val input = readLine("prompt> ") }
Using the jstat command as described in this question How do I test the heap of a running JVM from the command line? the first version uses significantly less heap than the last.
Edit 2:
Unit is actually a singleton object without content, so logically it does not require serialization. The fact that a type definition contains Unit tells you everything you need to deserialize a structure that has a field of type Unit.
By default, Spark uses Java serialization. Consider the following:
object Main extends App { import java.io.{ObjectOutputStream, FileOutputStream} case class Foo (a: String, b:String) case class Bar (a: String, b:String, c: Unit) val str = "abcdef" val foo = Foo("abcdef", "xyz") val bar = Bar("abcdef", "xyz", ()) val fos = new FileOutputStream( "foo.obj" ) val fo = new ObjectOutputStream( fos ) val bos = new FileOutputStream( "bar.obj" ) val bo = new ObjectOutputStream( bos ) fo writeObject foo bo writeObject bar }
Two files are the same size:
sr Main$Foo3 , z \ L at Ljava/lang/String;L bq ~ xpt abcdeft xyz
and
sr Main$Bar+a!N b L at Ljava/lang/String;L bq ~ xpt abcdeft xyz