Clojure lazy sequences in math.combinatorics result in OutOfMemory (OOM) error

The math.combinatorics documentation states that all functions return lazy sequences.

However, if I try to run subsets with lots of data,

(last (combinatorics/subsets (range 20))) ;OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:559) 

I get an OutOfMemory error.

Launch

 (last (range)) 

burns the CPU, but it does not return an error.

Clojure doesn't seem to hold on to his head as described in this Stack Overflow question .

Why is this happening and how can I use large ranges in subsets?

Update

It seems to work on some people's computers, as the comments show. So I will send my system configuration

I launched Mac (10.8.3) and installed Clojure (1.5.1) with Homebrew .

My Java version:

 % java -version java version "1.6.0_45" Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406) Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode) 

I have not changed any default settings. I also reinstalled all the dependencies by deleting the ~/.m2 .

My projects.clj .

And I used this command

 % lein repl nREPL server started on port 61774 REPL-y 0.1.10 Clojure 1.5.1 => (require 'clojure.math.combinatorics) nil => (last (clojure.math.combinatorics/subsets (range 20))) OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:570) or OutOfMemoryError Java heap space clojure.math.combinatorics/index-combinations/fn--1148/step--1164 (combinatorics.clj:64) 

I tested the problem on a colleague's laptop and it had the same problem, but it was on a Mac too.

+7
source share
5 answers

The problem is that subsets uses mapcat , and mapcat not lazy enough, because it uses apply, which implements and holds some elements that will be concatenated. See a very good explanation here . Using the lazier mapcat version of this link in subsets should fix the problem:

 (defn my-mapcat [f coll] (lazy-seq (if (not-empty coll) (concat (f (first coll)) (my-mapcat f (rest coll)))))) (defn subsets "All the subsets of items" [items] (my-mapcat (fn [n] (clojure.math.combinatorics/combinations items n)) (range (inc (count items))))) (last (subsets (range 50))) ;; this will take hours to compute, good luck with it! 
+3
source

Do you want to compute a set power set with 1000 elements? You know that you will have 2 ^ 1000 elements, right? It is so great that I canโ€™t even find a good way to describe how huge it is. If you are trying to work with such a set, and can do it lazily, your problem will not be memory: it will be computation time. Let's say you have a supercomputer with infinite memory, capable of processing a trillion units in a nanosecond: it is 10 ^ 21 items processed per second, or about 10 ^ 29 units per year. Even this supercomputer will take much longer than the life of the universe to work with elements (subsets (range 1000)) .

So, I would say, stop worrying about memory usage in this collection and work on an algorithm that does not involve walking in sequences with more elements than there are atoms in the universe.

+3
source

The problem is neither related to apply , nor to concat , nor to mapcat .

dAni's answer , where it reimplements mapcat , accidentally mapcat problem, but the reasoning behind it is incorrect. In addition, his answer points to an article where the author says: "I believe the problem is application." This is clearly wrong, which I will discuss below . Finally, the problem is not related to this other , where a non-lazy assessment is really caused by apply . p>

If you look closely, both dAni and the author of this article implement mapcat without using the map function. In the following example, I will show that the problem is related to how the map function is implemented.

To demonstrate that the problem is not related to either apply or concat , see the following mapcat implementation. It uses both concat and apply , but it achieves complete laziness:

 (defn map ([f coll] (lazy-seq (when-let [s (seq coll)] (cons (f (first s)) (map f (rest s))))))) (defn mapcat [f & colls] (apply concat (apply map f colls))) (defn range-announce! [x] (do (println "Returning sequence of" x) (range x))) ;; new fully lazy implementation prints only 5 lines (nth (mapcat range-announce! (range)) 5) ;; clojure.core version still prints 32 lines (nth (clojure.core/mapcat range-announce! (range)) 5) 

Complete laziness in the above code is achieved by overriding the map function. In fact, mapcat implemented exactly the same as in clojure.core , but it works completely lazy. The above map implementation is slightly simplified for the sake of example, since it supports only one parameter, but even its implementation with the whole variational signature will work the same way: complete laziness . So, we have shown that the problem here is not with apply and with concat . In addition, we showed that the real problem should be related to how the map function is implemented in clojure.core . Let's take a look at this:

 (defn map ([f coll] (lazy-seq (when-let [s (seq coll)] (if (chunked-seq? s) (let [c (chunk-first s) size (int (count c)) b (chunk-buffer size)] (dotimes [i size] (chunk-append b (f (.nth ci)))) (chunk-cons (chunk b) (map f (chunk-rest s)))) (cons (f (first s)) (map f (rest s)))))))) 

You can see that the clojure.core implementation is exactly the same as our "simplified" version before, except for branching the true if (chunked-seq? s) . Essentially clojure.core/map has a special case for handling input sequences, which are sequence sequences.

Separated sequences compromise laziness, evaluating in pieces 32 instead of strictly one at a time. This becomes painfully apparent when evaluating deeply nested fragmented sequences, as in the case of subsets . A number of sequences were introduced in Clojure 1.1, and many core functions have been updated to recognize and handle them differently, including map . The main goal of their implementation was to improve performance in certain flow processing scenarios, but perhaps they significantly complicate the discussion about the characteristics of laziness in the program. You can read on fragmented sequences here and here . Also check this question here .

The real problem is that range returns chunked seq and is used inside subsets . The fix recommended by David James corrects subsets to undo the sequence created by range internally.

+2
source

This issue was raised on the project tracker: Clojure JIRA: OutOfMemoryError with combinatorics / subsets . There you can find the patch from Andy Fingerhut. It worked for me. Note that the patch is different from the mapcat variant suggested by another answer .

+1
source

In the absence of command line arguments, JVM startup heap size parameters are determined using ergonomics

By default (JDK 6) there is

   
    initial heap size memory / 64
    maximum heap size MIN (memory / 4, 1GB)

but you can force the absolute value to use the arguments -Xmx and -Xms You can find in more detail here

0
source

All Articles