Reading DataFrame from Partitioned Parquet File

How to read a partitioned parquet with a condition as a dataframe,

it works great

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") 

There is a section for day=1 to day=30 , is it possible to read something like (day = 5 to 6) or day=5,day=6 ,

 val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") 

If I put * , he gave me all the data for 30 days, and it is too big.

+7
scala apache-spark spark-dataframe parquet
source share
3 answers

sqlContext.read.parquet can take several paths as input. If you want just day=5 and day=6 , you can just add two paths, for example:

 val dataframe = sqlContext .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/") 

If you have folders under day=X , for example say country=XX , country will be automatically added as a column in the dataframe .

EDIT: With Spark 1.6, you must provide a β€œbase path” of rejection so that Spark automatically generates columns. In Spark 1.6.x, the above should be rewritten to create a data frame with the columns "data", "year", "month" and "day":

 val dataframe = sqlContext .read .option("basePath", "file:///your/path/") .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/") 
+28
source share

you need to provide the option mergeSchema = true . as indicated below (this is from 1.6.0):

 val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD") 

This will read all the parquet files in the dataframe, and also creates the year, month and day columns in the dataframe data.

Link: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging

+4
source share

If you want to read several days, for example day = 5 and day = 6 , and want to specify a range in the path itself, you can use wildcards:

 val dataframe = sqlContext .read .parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*") 

Wildcards can also be used to indicate the range of days:

 val dataframe = sqlContext .read .parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*") 

This corresponds to all days from 5 to 10.

+1
source share

All Articles