R and shared memory for parallel :: mclapply

I am trying to use a quad-core machine, parallelizing an expensive operation that runs on a list of about 1000 items.

I am using the R parallel :: mclapply function currently:

res = rbind.fill(parallel::mclapply(lst, fun, mc.cores=3, mc.preschedule=T)) 

What works. The problem is that any additional subprocess that is spawned must allocate a large chunk of memory:

enter image description here

Ideally, I would like each core to access shared memory from the parent R process, so that by increasing the number of cores used in mclapply, I don't use RAM limits to the main ones.

I do not currently understand how to debug this problem. All the large data structures accessed by each process are global (currently). Is this somehow a problem?

I increased the maximum memory setting for the OS to 20 GB (available RAM):

 $ cat /etc/sysctl.conf kern.sysv.shmmax=21474836480 kern.sysv.shmall=5242880 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 kern.maxprocperuid=512 kern.maxproc=2048 

I thought this would fix, but the problem is still happening.

Any other ideas?

+7
source share
3 answers

Linux and macosx have a copy-on-write mechanism when forcing, which means that the memory pages are not actually copied, but separated until you write first. mclapply is fork () based, so probably (if you are not writing to your large shared data), then the memory that you see in your process sheet is not actual memory.

But when collecting the results, the master process will have to allocate memory for each returned mclapply result.

To help you further, we will need to learn more about your fun function.

+4
source

I think I would think that this would not use additional memory due to the ability to copy to write. I believe the items on this list are great? Perhaps when R passes items to fun (), it actually makes a copy of the list item instead of using the copy when writing. If so, the following may work better:

 fun <- function(itemNumber){ myitem <- lst[[itemNumber]] # now do your computations } res = rbind.fill(parallel::mclapply(1:length(lst), fun, mc.cores=3, mc.preschedule=T)) 

Or use lst[[itemNumber]] directly in your function. If R / Linux / macos is not smart enough to use the copy-on-write function when you wrote this function, it could be with this modified approach.

Edit: I assume that you are not modifying the items in the list. If you do, R will make copies of the data.

+1
source

Just tell me what could happen R-devel Digest, Volume 149, Issue 22

Radford Neal Response July 26, 2015:

When mclapply starts a new process, the memory is initially shared with the parent process. However, the memory page should be copied whenever any process writes to it. Unfortunately, the R garbage collector writes a mark to each object and mark it every time the complete garbage collection, so it is quite possible that each R object will be duplicated in each process, although many of them are practically unchanged (from the point of view of R programs).

+1
source

All Articles