Numerous HBase tables scan work

I am considering the following scenario. I have a data file sent daily. I add it to HBase with the file name -yyyyMMdd. Thus, for some period of time I have many databases, for example.

tempdb-20121220 tempdb-20121221 tempdb-20121222 tempdb-20121223 tempdb-20121224 tempdb-20121225 

Now I want a specific date range to get a list if the tables fit that range so that I can create indexes. I am using hbase-0.90.6

As for my research, then TableMapReduceUtil.initTableMapperJob takes only 1 table name.

 TableMapReduceUtil.initTableMapperJob( tableName, // input HBase table name scan, // Scan instance to control CF and attribute selection HBaseIndexerMapper.class, // mapper null, // mapper output key null, // mapper output value job ); 

I managed to get a list of tables and run it in a loop, but the idea is that I can scroll through all the tables, scan them (or something else) so that in the end I can get merges / combined results for indexing purposes.

Any direction to achieve this would be great and helpful.

+4
source share
2 answers

OK, check out HBase 0.94.6 sources (they seem to be closest to you). There you will find the MultiTableInputFormat class (follow the link to see a JavaDoc example, including an example) that does what you need. Just a few days ago, I had the experience of adding this class to the HBase 0.94.2 project (actually CDH 4.2.1 ). To success.

This is similar to what you need and very effective. The only problem here is that you will have one handler that processes all the data. To distinguish between tables, you probably need to take the TableSplit class from 0.94.6, rename it a little differently and the port so as not to disturb your environment. And please check the differences in TableMapReduceUtil - you will need to manually configure the scan so that the input format understands their configuration.

We HBase 0.94.6 also consider just switching to HBase 0.94.6 is much simpler because I couldnโ€™t complete it. He took about 12 working hours from me to understand the problems here / to investigate the solutions / to understand my problem with CDH 4.2.1 / port all. Good news for me: Cloudera intends to upgrade to 0.94.6 in CDH 4.3.0.

UPDATE1: CDH 4.3.0 is available and includes HBase 0.94.6 with all the necessary infrastructure.

UPDATE2: I switched to another solution - a custom input format that combines several HBase tables that mix their rows by key. It happened very useful, especially with the right key design. You get whole units in one cartographer. I am considering posting this code on github.

+3
source

List<scans> also a way. I agree with MultipleTableInputFormat:

 import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.util.Tool; public class TestMultiScan extends Configured implements Tool { @Override public int run(String[] arg0) throws Exception { List<Scan> scans = new ArrayList<Scan>(); Scan scan1 = new Scan(); scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("table1ddmmyyyy")); System.out.println(scan1.getAttribute("scan.attributes.table.name")); scans.add(scan1); Scan scan2 = new Scan(); scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("table2ddmmyyyy")); System.out.println(scan2.getAttribute("scan.attributes.table.name")); scans.add(scan2); Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(TestMultiScan.class); TableMapReduceUtil.initTableMapperJob( scans, MultiTableMappter.class, Text.class, IntWritable.class, job); TableMapReduceUtil.initTableReducerJob( "xxxxx", MultiTableReducer.class, job); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { TestMultiScan runJob = new TestMultiScan(); runJob.run(args); } } 

Thus, we solved our requirements for several rental requirements with tables with HBASE names. for example: DEV1: TABLEX (DATA INGESTED by DEV1) UAT1: TABLEX (DATA CONSUMED by UAT1) in mapper we want to compare both namespace tables to continue.

Inside, he used Multiple Table InputFormat, as shown in TableMapReduceUtil.java

TableMapReduceUtil internals for using MultiTableInputFormat

+1
source

All Articles