I need to count the number of times that certain pairs of integers occur in some process (heuristic detection of similar images using local sensitive hashing integers is an image, and the โneighborsโ below are images with the same hash value, so the count indicates how many different hashes connect this pair of images).
The bills are stored as a card from a (ordered) pair for counting ( matches below).
The input ( nbrs-list below) is a list of a list of integers that are considered neighbors, where each individual (ordered) pair in the internal list ("neighbors") should be considered. So, for example, if nbrs-list is [[1,2,3],[10,8,9]] , then the pairs [1,2],[1,3],[2,3],[8,9],[8,10],[9,10] .
The routine collect is called several times; the matches parameter accumulates results, and nbrs-list is new for each call. The smallest number of neighbors (the size of the internal list) is 1 and the largest ~ 1000. Each integer in nbrs-list occurs only once during any call to collect (this means that no call occurs more than once with each call), and integers cover the range 0-1000000 (therefore, the size of nbrs-list is less than 1,000,000 - since each value occurs only once, and sometimes they occur in groups, but usually more than 100,000, since most images do not have neighbors).
I pulled the routines from a larger piece of code, so they may contain small editing errors, sorry.
(defn- flatten-1 [list] (apply concat list)) (defn- pairs ([nbrs] (let [nbrs (seq nbrs)] (if nbrs (pairs (rest nbrs) (first nbrs) (rest nbrs))))) ([nbrs a bs] (lazy-seq (let [bs (seq bs)] (if bs (let [b (first bs)] (cons (if (> ab) [[ba] 1] [[ab] 1]) (pairs nbrs a (rest bs)))) (pairs nbrs)))))) (defn- pairs-map [nbrs] (println (count nbrs)) (let [pairs-list (flatten-1 (pairs nbrs))] (apply hash-map pairs-list))) (defn- collect [matches nbrs-list] (let [to-add (map pairs-map nbrs-list)] (merge-with + matches (apply (partial merge-with +) to-add))))
Thus, the above code expands each set of neighbors to ordered pairs; creates a card from pairs to 1 ; then merges the maps using adding values.
I would like it to work faster. I donโt see how to avoid decomposing O (n ^ 2) pairs, but I believe that at least I will reduce the overhead by avoiding so many intermediate structures. At the same time, I would like the code to be quite compact and readable ...
Oh, and now I have exceeded the "upper limit of GC". Thus, reducing memory usage / outflow is also a priority: o)
[Maybe this is too specific? I hope that the lessons are general and that there are not many messages about optimizing the "real" clojure code. Also, I can profile, etc., but my code seems so ugly. I hope for an obvious, cleaner approach - especially for expanding pairs.]