Iterating / looping over Spark parquet files in a script results in an error / memory buildup (using Spark SQL queries)

Question

Iterating / looping over Spark parquet files in a script results in an error / memory buildup (using Spark SQL queries)

I tried to figure out how to avoid sparks due to memory problems when I iterate over parquet files and several post-processing functions. Sorry for the flood of text, but this is not a very specific mistake (I use PySpark.) Sorry if this breaks the correct form!

The main pseudocode:

#fileNums are the file name partitions in the parquet file
#I read each one in as a separate file from its  "=" subdirectory
for counter in fileNums:
  sparkDataFrame = sqlContext.read.parquet(counter)
  summaryReportOne = sqlContext.sql.("SELECT.....")
  summaryReportOne.write.partition("id").parquet("/")
  summaryReportTwo = sqlContext.sql.("SELECT....")
  summaryReportTwo.write.partition("id").parquet("/")
  #several more queries, several involving joins, etc....

This code uses spark SQL queries, so I could not create a wrapper function with all the SQL queries / functions and pass it to foreach (which cannot accept sparkContext or sqlQuery as input), unlike the standard for a loop.

, , , , ; . python PySpark, () .

, mapPartition() - ?

script - - , Java. ( , , , .)

Caused by: com.google.protobuf.ServiceException:     
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:244)
at com.sun.proxy.$Proxy9.delete(Unknown Source)
at    org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:526)
... 42 more
Caused by: java.lang.OutOfMemoryError: Java heap space

, Spark , , SQL- Spark SQL, .

? sqlContext.dropTempTable() sqlContext.clearCache() . , , "" (, , "" , PySpark.)

, unpersist() dataframes , , persists() ; ( ).

, , , script ( - ).

- , , Spark. Spark 1.6.1.

+4

loops apache-spark pyspark apache-spark-sql pyspark-sql

kplaney 20 '16 1:03

2

noahpc · Answer 1 · 2016-08-08T15:11:40+0000

, 2.0.

java-, . 4G , 1.6.2.

2.0, SparkSession, 1,2 , , , .

kplaney · Answer 2 · 2016-05-20T18:49:06+0000

: unpersist() , SQL- , , ..clearCache() , , . , , , sparkSQL, RDD.

, persist() RDD, Spark, , SQL- .

Iterating / looping over Spark parquet files in a script results in an error / memory buildup (using Spark SQL queries)

More articles: