UPDATE: OK. Now I see your problem, for some crazy reason, turning your arrays into RDD return jsc.parallelize(ra); . Why are you doing it?? Why are you creating RDD RDD? Why not leave them as arrays? When you do a reduction, you can combine arrays. RDD is a robust distributed data set β the logical sense is not a distributed set of distributed data sets. I am surprised that your work even runs and does not fall! In any case, why is your work so slow.
In any case, in Scala, after your map, you simply execute flatMap(identity) , and this will bring all your lists together.
I really donβt understand why you need to make the reduction, perhaps it is there that you have something ineffective. Here is my code for reading HBase tables (its generalized, i.e. Works for any circuit). The only thing to do is make sure that when you read the HBase table, you guarantee that the number of partitions is suitable (usually you need a lot).
type HBaseRow = java.util.NavigableMap[Array[Byte], java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]] // Map(CF -> Map(column qualifier -> Map(timestamp -> value))) type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]] def navMapToMap(navMap: HBaseRow): CFTimeseriesRow = navMap.asScala.toMap.map(cf => (cf._1, cf._2.asScala.toMap.map(col => (col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2)))))) def readTableAll(table: String): RDD[(Array[Byte], CFTimeseriesRow)] = { val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, table) sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) .map(kv => (kv._1.get(), navMapToMap(kv._2.getMap))) }
As you can see, I do not need to reduce the code. These methods are pretty straightforward. I could dig further into your code, but I do not have the patience to read Java, since it is so epic a lot.
I have some more code specifically designed to extract the last elements from a string (and not the whole story). Let me know if you want to see it.
Finally, we recommend that you explore the use of Cassandra over HBase, as datastax works with databricks.