Loading data larger than memory size in h2o

Question

Loading data larger than memory size in h2o

I am experimenting with loading data more than the size of the memory in h2o.

H2o blog mentions: A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, ie, you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression. A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, ie, you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.

Here is the R code for connecting to h2o 3.6.0.8 :

 h2o.init(max_mem_size = '60m') # alloting 60mb for h2o, R is running on 8GB RAM machine

gives

 java version "1.8.0_65" Java(TM) SE Runtime Environment (build 1.8.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) .Successfully connected to http://127.0.0.1:54321/ R is connected to the H2O cluster: H2O cluster uptime: 2 seconds 561 milliseconds H2O cluster version: 3.6.0.8 H2O cluster name: H2O_started_from_R_RILITS-HWLTP_tkn816 H2O cluster total nodes: 1 H2O cluster total memory: 0.06 GB H2O cluster total cores: 4 H2O cluster allowed cores: 2 H2O cluster healthy: TRUE Note: As started, H2O is limited to the CRAN default of 2 CPUs. Shut down and restart H2O as shown below to use all your CPUs. > h2o.shutdown() > h2o.init(nthreads = -1) IP Address: 127.0.0.1 Port : 54321 Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75 Key Count : 3

I tried to load csv 169 MB in h2o.

 dat.hex <- h2o.importFile('dat.csv')

in which there was a mistake

 Error in .h2o.__checkConnectionHealth() : H2O connection has been severed. Cannot connect to instance at http://127.0.0.1:54321/ Failed to connect to 127.0.0.1 port 54321: Connection refused

which indicates a lack of memory.

Question: If H2o promises loads a data set that exceeds its memory size (the mechanism for switching to disk, as mentioned above in the blog), is this the right way to load data?

+7

java garbage-collection r out-of-memory h2o

talegari Dec 04 '15 at 7:10

source share

1 answer

Cliff click · Accepted Answer · 2015-12-17T18:41:58+0000

Switching to disk was turned off by default some time ago, because the performance was so poor. The bleeding edge (not the last stable one) has a flag to enable it: “Clean” (for “memory cleaner”).
Please note that your cluster has an EXTREMELY tiny memory: H2O cluster total memory: 0.06 GB This is 60 MB! Simple enough to run the JVM, much less run H2O. I would be surprised if H2O could normally approach it, not to mention the swap disk. Swapping is limited to replacing only data. If you are trying to perform a swap test, bring the JVM to 1 or 2 gigabytes, and then load the datasets that summarize more.

Cliff

Loading data larger than memory size in h2o

More articles: