I map the HBase table, generating one RDD element per HBase row. However, sometimes the string has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.
I have my original mapper returning Option to indicate that it returns 0 or 1 element, then filters for Some , then gets the contained value:
// myRDD is RDD[(ImmutableBytesWritable, Result)] val output = myRDD. map( tuple => getData(tuple._2) ). filter( {case Some(y) => true; case None => false} ). map( _.get ). // ... more RDD operations with the good data def getData(r: Result) = { val key = r.getRow var id = "(unk)" var x = -1L try { id = Bytes.toString(key, 0, 11) x = Long.MaxValue - Bytes.toLong(key, 11) // ... more code that might throw exceptions Some( ( id, ( List(x), // more stuff ... ) ) ) } catch { case e: NullPointerException => { logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e) None } } }
Is there a more idiomatic way to make this shorter? I feel this looks pretty dirty, both in getData() and in the map.filter.map dance that I do.
Maybe flatMap might work (generate 0 or 1 element in Seq ), but I don't want it to smooth out the tuples that I create in the map function, just delete the empty containers.
source share