"GC upper limit exceeded" in the cache of a large dataset to spark memory (via sparklyr & RStudio)

Question

"GC upper limit exceeded" in the cache of a large dataset to spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies that I am trying to work with, but so far they have managed to configure sparklyr in RStudio to connect to a stand-alone Spark cluster. The data is stored in Cassandra, and I can successfully transfer large datsets to Spark memory (cache) to continue the analysis on it.

However, recently I have had problems embedding one particularly large data set into Spark memory, although the cluster should have more than enough resources (60 cores, 200 GB of RAM) to process a data set of its size.

I thought that by limiting data caching to just a few columns of interest, I could overcome this problem (using the response code from my previous request here ), but that is not the case. What happens is the jar process on my local computer, which allows me to get all the local RAM and CPU resources, and the whole process freezes, and the cluster executors continue to fall and are added again. Oddly enough, this happens even when I select only 1 row for caching (which should make this data set much smaller than other data sets that I had no problems with caching in Spark memory).

I looked through the logs and they seem to be the only informative errors / warnings early on in the process:

17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed. 17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running ... 17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms 17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running 17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms 17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running

And after 20 minutes or so, all work ends with:

 java.lang.OutOfMemoryError: GC overhead limit exceeded

I changed my connection configuration to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and saw how to increase memoryOverhead by changing the settings on a yarn cluster (using spark.yarn.executor.memoryOverhead ), but not on a separate cluster .

In my configuration file, I experimented by adding each of the following parameters one at a time (none of them worked):

 spark.memory.fraction: 0.3 spark.executor.extraJavaOptions: '-Xmx24g' spark.driver.memory: "64G" spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m' spark.driver.extraJavaOptions: '-XX:+UseG1GC'

UPDATE: and my full current yml configuration file is as follows:

 default: # local settings sparklyr.sanitize.column.names: TRUE sparklyr.cores.local: 3 sparklyr.shell.driver-memory: "8G" # remote core/memory settings spark.executor.memory: "32G" spark.executor.cores: 5 spark.executor.heartbeatInterval: '180s' spark.ext.h2o.nthreads: 10 spark.cores.max: 30 spark.memory.storageFraction: 0.6 spark.memory.fraction: 0.3 spark.network.timeout: 300 spark.driver.extraJavaOptions: '-XX:+UseG1GC' # other configs for spark spark.serializer: org.apache.spark.serializer.KryoSerializer spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar # cassandra settings spark.cassandra.connection.host: <cassandra_ip> spark.cassandra.auth.username: <cassandra_login> spark.cassandra.auth.password: <cassandra_pass> spark.cassandra.connection.keep_alive_ms: 60000 # spark packages to load sparklyr.defaultPackages: - "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1" - "com.databricks:spark-csv_2.11:1.3.0" - "com.datastax.cassandra:cassandra-driver-core:3.0.2" - "com.amazonaws:aws-java-sdk-pom:1.10.34"

So my question is:

Does anyone have any ideas on what to do in this case?
Can I change the settings to help with this problem?
Alternatively, is it possible to import cassandra data into packages with RStudio / sparklyr as a driver?
Or, alternatively, is there a way to make munge / filter / edit data, since it is cached, so the resulting table is smaller (similar to using SQL queries, but with more complex dplyr syntax)?

+7

r cassandra apache-spark sparklyr

renegademonkey Mar 6 '17 at 12:12

source share

1 answer

renegademonkey · Accepted Answer · 2017-03-22T16:49:21+0000

Well, I finally managed to do this job!

At first, I tried to suggest @ user6910411 to reduce the size of the cassandra shared input, but this did not succeed in a similar way. After playing with other things, today I tried to change this setting in the opposite direction:

 spark.cassandra.input.split.size_in_mb: 254

INCREASING the size of the split, there were fewer spark tasks and therefore less overhead and fewer calls in the GC. It worked!

"GC upper limit exceeded" in the cache of a large dataset to spark memory (via sparklyr & RStudio)

More articles: