How to get all data from hbase table in spark mode

Question

How to get all data from hbase table in spark mode

I have a large table in hbase, this is the name UserAction, and it has three column families (song, album, singer). I need to get all the data from the song column collection as a JavaRDD object. I try this code, but it is not efficient. Is there a better solution for this?

static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster( "local[4]"); static JavaSparkContext jsc = new JavaSparkContext(sparkConf); static void getRatings() { Configuration conf = HBaseConfiguration.create(); conf.set(TableInputFormat.INPUT_TABLE, "UserAction"); conf.set(TableInputFormat.SCAN_COLUMN_FAMILY, "song"); JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc .newAPIHadoopRDD( conf, TableInputFormat.class, org.apache.hadoop.hbase.io.ImmutableBytesWritable.class, org.apache.hadoop.hbase.client.Result.class); JavaRDD<Rating> count = hBaseRDD .map(new Function<Tuple2<ImmutableBytesWritable, Result>, JavaRDD<Rating>>() { @Override public JavaRDD<Rating> call( Tuple2<ImmutableBytesWritable, Result> t) throws Exception { Result r = t._2; int user = Integer.parseInt(Bytes.toString(r.getRow())); ArrayList<Rating> ra = new ArrayList<>(); for (Cell c : r.rawCells()) { int product = Integer.parseInt(Bytes .toString(CellUtil.cloneQualifier(c))); double rating = Double.parseDouble(Bytes .toString(CellUtil.cloneValue(c))); ra.add(new Rating(user, product, rating)); } return jsc.parallelize(ra); } }) .reduce(new Function2<JavaRDD<Rating>, JavaRDD<Rating>, JavaRDD<Rating>>() { @Override public JavaRDD<Rating> call(JavaRDD<Rating> r1, JavaRDD<Rating> r2) throws Exception { return r1.union(r2); } }); jsc.stop(); }

Column family diagram design:

 RowKey = userID, columnQualifier = songID and value = rating.

+7

java hbase mapreduce bigdata apache-spark

Fatih yakut Jul 2 '14 at 14:56

source share

1 answer

samthebest · Answer 1 · 2014-07-03T10:23:31+0000

UPDATE: OK. Now I see your problem, for some crazy reason, turning your arrays into RDD return jsc.parallelize(ra); . Why are you doing it?? Why are you creating RDD RDD? Why not leave them as arrays? When you do a reduction, you can combine arrays. RDD is a robust distributed data set — the logical sense is not a distributed set of distributed data sets. I am surprised that your work even runs and does not fall! In any case, why is your work so slow.

In any case, in Scala, after your map, you simply execute flatMap(identity) , and this will bring all your lists together.

I really don’t understand why you need to make the reduction, perhaps it is there that you have something ineffective. Here is my code for reading HBase tables (its generalized, i.e. Works for any circuit). The only thing to do is make sure that when you read the HBase table, you guarantee that the number of partitions is suitable (usually you need a lot).

 type HBaseRow = java.util.NavigableMap[Array[Byte], java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]] // Map(CF -> Map(column qualifier -> Map(timestamp -> value))) type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]] def navMapToMap(navMap: HBaseRow): CFTimeseriesRow = navMap.asScala.toMap.map(cf => (cf._1, cf._2.asScala.toMap.map(col => (col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2)))))) def readTableAll(table: String): RDD[(Array[Byte], CFTimeseriesRow)] = { val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, table) sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) .map(kv => (kv._1.get(), navMapToMap(kv._2.getMap))) }

As you can see, I do not need to reduce the code. These methods are pretty straightforward. I could dig further into your code, but I do not have the patience to read Java, since it is so epic a lot.

I have some more code specifically designed to extract the last elements from a string (and not the whole story). Let me know if you want to see it.

Finally, we recommend that you explore the use of Cassandra over HBase, as datastax works with databricks.

How to get all data from hbase table in spark mode

More articles: