Parquet ends with reading memory

I am trying to multiply by a large matrix, which is stored in the parquet format, so try not to store the RDD in memory, but I get an OOM error from the parquet reader:

15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0 (TID 28398, 172.31.34.233): java.lang.OutOfMemoryError: Java heap space at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) ... 

In particular, the matrix is ​​a dense matrix of size 46752 by 54843120 of 32-bit floats, which is stored in the parquet format (each row is about 1.7 GB without compression).

The following code loads this matrix as a Spark IndexedRowMatrix and multiplies it by a random vector (strings are stored with the corresponding string label, and floats should be converted to double since IndexedRows can only use doubles):

 val rows = { sqlContext.read.parquet(datafname).rdd.map { case SQLRow(rowname: String, values: WrappedArray[Float]) => // DenseVectors have to be doubles val vector = new DenseVector(values.toArray.map(v => v.toDouble)) new IndexedRow(indexLUT(rowname), vector) } } val nrows : Long = 46752 val ncols = 54843120 val A = new IndexedRowMatrix(rows, nrows, ncols) A.rows.unpersist() // doesn't help avoid OOM val x = new DenseMatrix(ncols, 1, BDV.rand(ncols).data) A.multiply(x).rows.collect 

At startup

I use the following options:
 --driver-memory 220G --num-executors 203 --executor-cores 4 --executor-memory 25G --conf spark.storage.memoryFraction=0 

There are 25573 partitions in the parquet file, so the uncompressed Float values ​​for each partition must be less than 4 GB; I expect this to mean that the current artist memory is much more than sufficient (I cannot raise the artist’s memory setting).

Any ideas why this is happening in OOM errors and how to fix them? As far as I can tell, there is no reason why a parquet reader does not store anything.

+8
mapreduce bigdata apache-spark parquet
source share

No one has answered this question yet.

See related questions:

3
Error when there is no memory when recording light frames in parquet format
3
Parquet: reading individual columns into memory
3
Parquet predicate attack
3
memory spark performer in join and reduceByKey
2
Reading a parquet file in a driver class
2
Scala Spark Operations Leading to Memory Errors
one
Glow of a spark parquet
one
Sparks from memory when reduced by key
0
intrinsically safe reading parquet
0
the request takes a long time, "picking" nothing

All Articles