I tried to figure out how to avoid sparks due to memory problems when I iterate over parquet files and several post-processing functions. Sorry for the flood of text, but this is not a very specific mistake (I use PySpark.) Sorry if this breaks the correct form!
The main pseudocode:
for counter in fileNums:
sparkDataFrame = sqlContext.read.parquet(counter)
summaryReportOne = sqlContext.sql.("SELECT.....")
summaryReportOne.write.partition("id").parquet("/")
summaryReportTwo = sqlContext.sql.("SELECT....")
summaryReportTwo.write.partition("id").parquet("/")
This code uses spark SQL queries, so I could not create a wrapper function with all the SQL queries / functions and pass it to foreach (which cannot accept sparkContext or sqlQuery as input), unlike the standard for a loop.
, , , , ; . python PySpark, () .
, mapPartition() - ?
script - - , Java. ( , , , .)
Caused by: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:244)
at com.sun.proxy.$Proxy9.delete(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:526)
... 42 more
Caused by: java.lang.OutOfMemoryError: Java heap space
, Spark , , SQL- Spark SQL, .
? sqlContext.dropTempTable() sqlContext.clearCache() . , , "" (, , "" , PySpark.)
, unpersist() dataframes , , persists() ; ( ).
, , , script ( - ).
- , , Spark. Spark 1.6.1.