Your expression
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap) .load();
load the entire table into memory. You have not provided any filter for the phoenix to click on hbase - and thus reduce the number of lines read.
If you connect to a data source other than HBase, for example, to a flat file, then first you will need to read all the entries from the hbase table. Records that do not match the second data source will not be saved to the new DataFrame - but the initial reading will still happen.
Update . A potential approach would be to preprocess the file β that is, extract the desired identifier. Save the results in a new HBase table. Then connect directly to HBase via Phoenix, not Spark.
The rationale for this approach is to move the calculation to data. Most of the data is in HBase - so move the little data there (id in the files).
I am not familiar with Phoenix directly, except that it provides an sql layer on top of hbase. Presumably then he could perform such a join and save the result in a separate HBase table ..? This separate table can then be loaded into Spark for use in subsequent calculations.
source share