Reading parquet files from multiple directories in Pyspark

Question

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from several paths that are not parent or child directories.

eg,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I am reading every directory and changing dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using, unionAllor is there any fancy way usingunionAll

thank

+5

pyspark parquet

joshsuihn May 16 '16 at 15:04

source share

3 answers

N00b · Answer 1 · 2017-05-10T00:03:08+0000

A bit late, but I found this while I was looking, and it might help someone else ...

You can also unpack the argument list into spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is useful if you want to pass multiple blocks to the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

, basePath, .

John Conley · Answer 2 · 2016-05-17T06:37:32+0000

parquetFile SQLContext parquet DataFrameReader . :

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

VenVig · Answer 3 · 2017-10-26T17:57:13+0000

, ( Jupyter PySpark), .

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

Reading parquet files from multiple directories in Pyspark

More articles: