Recording to HBase via Spark: the task is not serializable

Question

Recording to HBase via Spark: the task is not serializable

I am trying to write some simple data in HBase (0.96.0-hadoop2) using Spark 1.0, but I keep getting serialization issues. Here is the relevant code:

import org.apache.hadoop.hbase.client._ import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.util.Bytes import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.mapred.JobConf import org.apache.spark.SparkContext import java.util.Properties import java.io.FileInputStream import org.apache.hadoop.hbase.client.Put object PutRawDataIntoHbase{ def main(args: Array[String]): Unit = { var propFileName = "hbaseConfig.properties" if(args.size > 0){ propFileName = args(0) } /** Load properties here **/ val theData = sc.textFile(prop.getProperty("hbase.input.filename")) .map(l => l.split("\t")) .map(a => Array("%010d".format(a(9).toInt)+ "-" + a(0) , a(1))) val tableName = prop.getProperty("hbase.table.name") val hbaseConf = HBaseConfiguration.create() hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir")) hbaseConf.addResource(prop.getProperty("hbase.site.xml")) val myTable = new HTable(hbaseConf, tableName) theData.foreach(a=>{ var p = new Put(Bytes.toBytes(a(0))) p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1))) myTable.put(p) }) } }

Running the code results in:

 Failed to run foreach at putDataIntoHBase.scala:79 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:org.apache.hadoop.hbase.client.HTable

Replacing foreach on the card does not fall, but I do not write either. Any help would be greatly appreciated.

+7

scala hbase apache-spark

user3930663 Aug 11 '14 at 19:21

source share

1 answer

Wildfire · Accepted Answer · 2014-08-12T09:04:25+0000

The HBaseConfiguration class is a pool of connections to HBase servers. Obviously, it cannot be serialized and sent to work nodes. Because HTable uses this pool to communicate with HBase servers, it also cannot be serialized.

Basically, there are three ways to solve this problem:

Open a connection on each of the work nodes.

Note the use of the foreachPartition method:

 val tableName = prop.getProperty("hbase.table.name") <......> theData.foreachPartition { iter => val hbaseConf = HBaseConfiguration.create() <... configure HBase ...> val myTable = new HTable(hbaseConf, tableName) iter.foreach { a => var p = new Put(Bytes.toBytes(a(0))) p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1))) myTable.put(p) } }

Please note that each of the work nodes must have access to HBase servers and must have pre-installed banks or provided through ADD_JARS .

Also note that since the connection pool, if it is open for each of the sections, it would be advisable to reduce the number of sections to approximately the number of work nodes (using the coalesce function). It is also possible to split one instance of HTable on each of the work nodes, but this is not so trivial.

Serialize all data in one box and write it to HBase

You can write all data from RDD from one computer, even if the data is not suitable for memory. Details are explained in this answer: Spark: Best Practice for Retrieving Big Data from RDD to a Local Computer

Of course, it will be slower than a distributed letter, but it is simple, does not cause serialization problems and may be a better approach if the data size is reasonable.

Use HadoopOutputFormat

You can create a custom HadoopOutputFormat for HBase or use an existing one. I'm not sure if there is something that fits your needs, but Google should help here.

PS By the way, the map call does not fail, because it does not get an estimate: RDDs are not evaluated until you call the function with side effects. For example, if you called theData.map(....).persist , this would also work.

Recording to HBase via Spark: the task is not serializable

Open a connection on each of the work nodes.

Serialize all data in one box and write it to HBase

Use HadoopOutputFormat

More articles: