Since B is small, I think the best way to do this is with a broadcast variable and a user-defined function.
// However you get the data... case class BType( A2: Int, B2: Int, C2 : Int, D2 : String) val B = Seq(BType(1,1,1,"B111"), BType(1,1,2, "B112"), BType(2,0,0, "B200")) val A = sc.parallelize(Seq((1,1,1, "DATA"), (1,1,2, "DATA"), (2, 0, 0, "DATA"), (2, 0, 1, "NONE"), (3, 0, 0, "NONE"))).toDF("A1", "B1", "C1", "OTHER") // Broadcast B so all nodes have a copy of it. val Bbradcast = sc.broadcast(B) // A user defined function to find the value for D2. This I'm sure could be improved by whacking it into maps. But this is a small example. val findD = udf {( a: Int, b : Int, c: Int) => Bbradcast.value.find(x => x.A2 == a && x.B2 == b && x.C2 == c).getOrElse(Bbradcast.value.find(x => x.A2 == a && x.B2 == b).getOrElse(BType(0,0,0,"NA"))).D2 } // Use the UDF in a select A.select($"A1", $"B1", $"C1", $"OTHER", findD($"A1", $"B1", $"C1").as("D")).show