The HBaseConfiguration class is a pool of connections to HBase servers. Obviously, it cannot be serialized and sent to work nodes. Because HTable uses this pool to communicate with HBase servers, it also cannot be serialized.
Basically, there are three ways to solve this problem:
Open a connection on each of the work nodes.
Note the use of the foreachPartition method:
val tableName = prop.getProperty("hbase.table.name") <......> theData.foreachPartition { iter => val hbaseConf = HBaseConfiguration.create() <... configure HBase ...> val myTable = new HTable(hbaseConf, tableName) iter.foreach { a => var p = new Put(Bytes.toBytes(a(0))) p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1))) myTable.put(p) } }
Please note that each of the work nodes must have access to HBase servers and must have pre-installed banks or provided through ADD_JARS .
Also note that since the connection pool, if it is open for each of the sections, it would be advisable to reduce the number of sections to approximately the number of work nodes (using the coalesce function). It is also possible to split one instance of HTable on each of the work nodes, but this is not so trivial.
Serialize all data in one box and write it to HBase
You can write all data from RDD from one computer, even if the data is not suitable for memory. Details are explained in this answer: Spark: Best Practice for Retrieving Big Data from RDD to a Local Computer
Of course, it will be slower than a distributed letter, but it is simple, does not cause serialization problems and may be a better approach if the data size is reasonable.
Use HadoopOutputFormat
You can create a custom HadoopOutputFormat for HBase or use an existing one. I'm not sure if there is something that fits your needs, but Google should help here.
PS By the way, the map call does not fail, because it does not get an estimate: RDDs are not evaluated until you call the function with side effects. For example, if you called theData.map(....).persist , this would also work.