I am trying to multiply by a large matrix, which is stored in the parquet format, so try not to store the RDD in memory, but I get an OOM error from the parquet reader:
15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0 (TID 28398, 172.31.34.233): java.lang.OutOfMemoryError: Java heap space at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) ...
In particular, the matrix is ββa dense matrix of size 46752 by 54843120 of 32-bit floats, which is stored in the parquet format (each row is about 1.7 GB without compression).
The following code loads this matrix as a Spark IndexedRowMatrix and multiplies it by a random vector (strings are stored with the corresponding string label, and floats should be converted to double since IndexedRows can only use doubles):
val rows = { sqlContext.read.parquet(datafname).rdd.map { case SQLRow(rowname: String, values: WrappedArray[Float]) =>
At startup
I use the following options:
--driver-memory 220G --num-executors 203 --executor-cores 4 --executor-memory 25G --conf spark.storage.memoryFraction=0
There are 25573 partitions in the parquet file, so the uncompressed Float values ββfor each partition must be less than 4 GB; I expect this to mean that the current artist memory is much more than sufficient (I cannot raise the artistβs memory setting).
Any ideas why this is happening in OOM errors and how to fix them? As far as I can tell, there is no reason why a parquet reader does not store anything.
mapreduce bigdata apache-spark parquet
Aatg
source share