This is definitely an approach, but if you do 2 random reads on a scanned line, then your speed will drop. If you are strongly filtering strings or have a small data set in A, this may not be a problem.
Merge Sort
However, the best approach that will be available in HBase 0.96 is the MultipleTableInput method. This means that he scans table A and writes it using a unique key that will allow him to map table B.
eg. The output of table A (b_id, a_info) and table B will emit (b_id, b_info) converging together in the gearbox.
This is an example of a sort-merge union.
Nested Loop Join
If you join a row key or a join attribute is sorted according to table B, you can have a scanner instance in each task that reads from table B sequentially until it finds what it is looking for.
eg. Table. Row key = "companyId" and table B row key = "companyId_employeeId". Then for each Company in Table A you can get all the employees using the algorithm using the nest-loop algorithm.
pseudo code:
for(company in TableA): for(employee in TableB): if employee.company_id == company.id: emit(company.id, employee)
This is an example of a contour join.
More detailed connection algorithms are given here:
Bryan
source share