Spark does not use parquet hdfs splitting

Question

Spark does not use parquet hdfs splitting

I am writing a parquet file in hdfs using the following command: df.write.mode(SaveMode.Append).partitionBy(id).parquet(path)

Then I read and filter the file as follows:

val file = sqlContext.read.parquet(folder)
val data = file.map(r => Row(r.getInt(4).toString, r.getString(0), r.getInt(1),
    r.getLong(2), r.getString(3)))

val filteredData = data.filter(x => x.thingId.equals("1"))
filteredData.collect()

I would expect Spark to use file splitting and only read the section "thingId = 1". In fact, Spark reads all sections of the file, not just the filtered ones (section with thingId = 1). If I look in the magazines, I see that he reads everything:

03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files from HDFS: //sandbox.hortonworks.com/path/id=1/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz .parquet 03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files (s) from HDFS: //sandbox.hortonworks.com/path/id=42/part-r-00000-b4e27b02-9a21-4915- 89a7-189c30ca3fe3.gz.parquet 03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files (s) from HDFS: //sandbox.hortonworks.com/path/id=17/part-r-00000-b4e27b02 -9a21-4915-89a7-189c30ca3fe3.gz.parquet 03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files (s) from HDFS: //sandbox.hortonworks.com/path/0833/id=33/ part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files (s) from HDFS: //sandbox.hortonworks.com / path / id = 26 / part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 03/16/21 01:32:33 INFO ParquetRelation: Reading parquet files (s) from HDFS: / /sandbox.hortonworks.com/path/id=12/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet

-, ? , Spark , thingID = 1. - , ?

+4

hadoop hdfs bigdata apache-spark parquet

AlexL 21 . '16 8:59

1

Tzach Zohar · Accepted Answer · 2016-03-21T10:00:08+0000

Spark "" (.. ):

: Spark (spark.sql.parquet.filterPushdown) . Spark 1.5.0 -

"" . : , ( ?), filter, . Spark "" , - Spark, Row => Boolean, ...

, -, "" , filter, , Spark, :

// assuming the relevant column name is "id" in the parquet structure
val filtered = file.filter("id = 1") 

// or:
val filtered = file.filter(col("id") === 1) 

// and only then:
val data = filtered.map(r => Row(...))

Spark does not use parquet hdfs splitting

More articles: