Fast way to count the number of pairs in Clojure 1.4

I need to count the number of times that certain pairs of integers occur in some process (heuristic detection of similar images using local sensitive hashing integers is an image, and the โ€œneighborsโ€ below are images with the same hash value, so the count indicates how many different hashes connect this pair of images).

The bills are stored as a card from a (ordered) pair for counting ( matches below).

The input ( nbrs-list below) is a list of a list of integers that are considered neighbors, where each individual (ordered) pair in the internal list ("neighbors") should be considered. So, for example, if nbrs-list is [[1,2,3],[10,8,9]] , then the pairs [1,2],[1,3],[2,3],[8,9],[8,10],[9,10] .

The routine collect is called several times; the matches parameter accumulates results, and nbrs-list is new for each call. The smallest number of neighbors (the size of the internal list) is 1 and the largest ~ 1000. Each integer in nbrs-list occurs only once during any call to collect (this means that no call occurs more than once with each call), and integers cover the range 0-1000000 (therefore, the size of nbrs-list is less than 1,000,000 - since each value occurs only once, and sometimes they occur in groups, but usually more than 100,000, since most images do not have neighbors).

I pulled the routines from a larger piece of code, so they may contain small editing errors, sorry.

 (defn- flatten-1 [list] (apply concat list)) (defn- pairs ([nbrs] (let [nbrs (seq nbrs)] (if nbrs (pairs (rest nbrs) (first nbrs) (rest nbrs))))) ([nbrs a bs] (lazy-seq (let [bs (seq bs)] (if bs (let [b (first bs)] (cons (if (> ab) [[ba] 1] [[ab] 1]) (pairs nbrs a (rest bs)))) (pairs nbrs)))))) (defn- pairs-map [nbrs] (println (count nbrs)) (let [pairs-list (flatten-1 (pairs nbrs))] (apply hash-map pairs-list))) (defn- collect [matches nbrs-list] (let [to-add (map pairs-map nbrs-list)] (merge-with + matches (apply (partial merge-with +) to-add)))) 

Thus, the above code expands each set of neighbors to ordered pairs; creates a card from pairs to 1 ; then merges the maps using adding values.

I would like it to work faster. I donโ€™t see how to avoid decomposing O (n ^ 2) pairs, but I believe that at least I will reduce the overhead by avoiding so many intermediate structures. At the same time, I would like the code to be quite compact and readable ...

Oh, and now I have exceeded the "upper limit of GC". Thus, reducing memory usage / outflow is also a priority: o)

[Maybe this is too specific? I hope that the lessons are general and that there are not many messages about optimizing the "real" clojure code. Also, I can profile, etc., but my code seems so ugly. I hope for an obvious, cleaner approach - especially for expanding pairs.]

+3
source share
4 answers

The above suggestions seemed to help, but were insufficient. I finally got decent performance with:

  • packing pairs into a long value. this works because instances of MAX_LONG > 1e12 and long comparable (they work just as well as hash keys, unlike long[2] ). this significantly reduced memory usage compared to [n1 n2] .

  • using the primitive type of the hash map of the primitive type and mutating it.

  • processing code pairs with nested doseq loops (or nested for loops using immutable data structures).

  • improves my location-sensitive hash. a significant part of the problem was that it was too weak, so finding too many neighbors - if the neighbors from a million images are very limited, then you will get a million million pairs that consume too much memory ...

The inner loop now looks like this:

 (doseq [i (range 1 (count nbrs))] (doseq [j (range i)] (let [pair (pack size (nth nbrs i) (nth nbrs j))] (if-let [value (.get matches pair)] (.put matches pair (byte (inc value))) (.put matches pair (byte 0)))))) 

Where

 (defn- pack [size n1 n2] (+ (* size n1) n2)) (defn- unpack [size pair] [(int (/ pair size)) (mod pair size)]) 
0
source

I think you need the frequency with which each pair occurs?

Try the frequencies function. It uses transients under the hood, which should avoid the GC overhead.

+3
source

(Hope I didnโ€™t get your question wrong)

If you just want to count pairs in lists like this, then [[1,2,3], [8,9,10]]

 (defn count-nbr-pairs [n] (/ (* n (dec n)) 2)) (defn count-pairs [nbrs-list] (apply + (map #(count-nbr-pairs (count %)) nbrs-list))) (count-pairs [[1 2 3 4] [8 9 10]]) ; => 9 

This, of course, assumes that you do not need to remove duplicate pairs.

+1
source
 =>(require '[clojure.math.combinatorics :as c]) =>(map #(c/combinations % 2) [[1 2 3] [8 9 10]]) (((1 2) (1 3) (2 3)) ((8 9) (8 10) (9 10))) 

This is a pretty small library, look at the source

The performance you are looking at is the following number for your use of 1K unique values โ€‹โ€‹under 1M

 => (time (doseq [i (c/combinations (take 1000 (shuffle (range 1000000))) 2)])) "Elapsed time: 270.99675 msecs" 

This includes creating a target set that takes about 100 ms on my machine.

+1
source

All Articles