PHOENIX SPARK - load a table as a DataFrame

Question

PHOENIX SPARK - load a table as a DataFrame

I created a DataFrame from an HBase table (PHOENIX), which has 500 million rows. From the DataFrame, I created the RDD JavaBean and used it to connect to the data from the file.

Map<String, String> phoenixInfoMap = new HashMap<String, String>(); phoenixInfoMap.put("table", tableName); phoenixInfoMap.put("zkUrl", zkURL); DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load(); JavaRDD<Row> tableRows = df.toJavaRDD(); JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair( new PairFunction<Row, String, String>() { @Override public Tuple2<String, String> call(Row row) throws Exception { return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME")); } });

Now my question is: let's say a file has 2 unique million records corresponding to a table. Is the entire table loaded into memory as an RDD or only the corresponding 2 million records from a table will be loaded into memory as an RDD?

+5

dataframe apache-spark phoenix

Mohan May 18 '16 at 3:38

source share

1 answer

javadba · Accepted Answer · 2016-05-18T04:15:56+0000

Your expression

 DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap) .load();

load the entire table into memory. You have not provided any filter for the phoenix to click on hbase - and thus reduce the number of lines read.

If you connect to a data source other than HBase, for example, to a flat file, then first you will need to read all the entries from the hbase table. Records that do not match the second data source will not be saved to the new DataFrame - but the initial reading will still happen.

Update . A potential approach would be to preprocess the file — that is, extract the desired identifier. Save the results in a new HBase table. Then connect directly to HBase via Phoenix, not Spark.

The rationale for this approach is to move the calculation to data. Most of the data is in HBase - so move the little data there (id in the files).

I am not familiar with Phoenix directly, except that it provides an sql layer on top of hbase. Presumably then he could perform such a join and save the result in a separate HBase table ..? This separate table can then be loaded into Spark for use in subsequent calculations.

PHOENIX SPARK - load a table as a DataFrame

More articles: