I am new to Clojure and I have code that I am trying to optimize. I want to calculate a match. The main function is the computing space, and the output is an embedded map of the type
{"w1" {"w11" 10, "w12" 31, ...}
"w2" {"w21" 14, "w22" 1, ...}
...
}
means "w1" cooccurs with "w11" 10 times, etc.
It requires a set of documents (sentences) and columns of target words, iterates over them, and finally applies context-fn, such as a sliding window, to extract contextual words. More specifically, I skip closing on a sliding window
(compute-space docs (fn [target doc] (sliding-window target doc 5)) targets)
I tested it with approximately 50 million words (~ 3 million sentences) and approx. 20,000 goals. It will take more than one day to complete this version. I also wrote a parallel function pmap (pcompute-space), which would reduce the computational time to about 10 hours, but I still feel that it should be faster. I have no other code to compare, but my intuition says that it should be faster.
(defn compute-space
([docs context-fn targets]
(let [space (atom {})]
(doseq [doc docs
target targets]
(when-let [contexts (context-fn target doc)]
(doseq [w contexts]
(if (get-in @space [target w])
(swap! space update-in [target w] (partial inc))
(swap! space assoc-in [target w] 1)))))
@space)))
(defn sliding-window
[target s n]
(loop [todo s seen [] acc []]
(let [curr (first todo)]
(cond (= curr target) (recur (rest todo) (cons curr seen) (concat acc (take n seen) (take n (rest todo))))
(empty? todo) acc
:else (recur (rest todo) (cons curr seen) acc)))))
(defn pcompute-space
[docs step context-fn targets]
(reduce
(pmap
(fn [chunk]
(do (tick))
(compute-space chunk context-fn targets))
(partition-all step docs)))
jvisualvm, , clojure.lang.Cons, clojure.lang.ChunkedCons clojure.lang.ArrayChunk (. ). , , , ( , , , .., ).
, , . , , , / .

SPECS
MacPro 2010 2,4 Intel Core 2 Duo 4
Clojure 1.6.0
Java 1.7.0_51 Java HotSpot (TM) 64- VM
GithubGist