Reading DataFrame from Partitioned Parquet File

Question

Reading DataFrame from Partitioned Parquet File

How to read a partitioned parquet with a condition as a dataframe,

it works great

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")

There is a section for day=1 to day=30 , is it possible to read something like (day = 5 to 6) or day=5,day=6 ,

 val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")

If I put * , he gave me all the data for 30 days, and it is too big.

+7

scala apache-spark spark-dataframe parquet

Woodhophopper Nov 11 '15 at 12:19

source share

3 answers

you need to provide the option mergeSchema = true . as indicated below (this is from 1.6.0):

 val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD")

This will read all the parquet files in the dataframe, and also creates the year, month and day columns in the dataframe data.

Link: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging

+4

Kiran N Sep 01 '16 at 0:02

source share

If you want to read several days, for example day = 5 and day = 6 , and want to specify a range in the path itself, you can use wildcards:

 val dataframe = sqlContext .read .parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*")

Wildcards can also be used to indicate the range of days:

 val dataframe = sqlContext .read .parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*")

This corresponds to all days from 5 to 10.

+1

Neelesh sambhajiche Mar 18 '18 at 5:16

source share

Glennie helles sindholt · Accepted Answer · 2015-11-11T17:45:00+0000

sqlContext.read.parquet can take several paths as input. If you want just day=5 and day=6 , you can just add two paths, for example:

 val dataframe = sqlContext .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/")

If you have folders under day=X , for example say country=XX , country will be automatically added as a column in the dataframe .

EDIT: With Spark 1.6, you must provide a “base path” of rejection so that Spark automatically generates columns. In Spark 1.6.x, the above should be rewritten to create a data frame with the columns "data", "year", "month" and "day":

 val dataframe = sqlContext .read .option("basePath", "file:///your/path/") .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", "file:///your/path/data=jDD/year=2015/month=10/day=6/")

Reading DataFrame from Partitioned Parquet File

More articles: