I am experimenting with loading data more than the size of the memory in h2o.
H2o blog mentions: A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, ie, you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression. A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, ie, you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.
Here is the R code for connecting to h2o 3.6.0.8 :
h2o.init(max_mem_size = '60m')
gives
java version "1.8.0_65" Java(TM) SE Runtime Environment (build 1.8.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) .Successfully connected to http://127.0.0.1:54321/ R is connected to the H2O cluster: H2O cluster uptime: 2 seconds 561 milliseconds H2O cluster version: 3.6.0.8 H2O cluster name: H2O_started_from_R_RILITS-HWLTP_tkn816 H2O cluster total nodes: 1 H2O cluster total memory: 0.06 GB H2O cluster total cores: 4 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE Note: As started, H2O is limited to the CRAN default of 2 CPUs. Shut down and restart H2O as shown below to use all your CPUs. > h2o.shutdown() > h2o.init(nthreads = -1) IP Address: 127.0.0.1 Port : 54321 Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75 Key Count : 3
I tried to load csv 169 MB in h2o.
dat.hex <- h2o.importFile('dat.csv')
in which there was a mistake
Error in .h2o.__checkConnectionHealth() : H2O connection has been severed. Cannot connect to instance at http:
which indicates a lack of memory.
Question: If H2o promises loads a data set that exceeds its memory size (the mechanism for switching to disk, as mentioned above in the blog), is this the right way to load data?
java garbage-collection r out-of-memory h2o
talegari
source share