Clojure lazy-seq performance optimization

I am new to Clojure and I have code that I am trying to optimize. I want to calculate a match. The main function is the computing space, and the output is an embedded map of the type

{"w1" {"w11" 10, "w12" 31, ...}
 "w2" {"w21" 14, "w22" 1,  ...}
 ... 
 }

means "w1" cooccurs with "w11" 10 times, etc.

It requires a set of documents (sentences) and columns of target words, iterates over them, and finally applies context-fn, such as a sliding window, to extract contextual words. More specifically, I skip closing on a sliding window

(compute-space docs (fn [target doc] (sliding-window target doc 5)) targets)

I tested it with approximately 50 million words (~ 3 million sentences) and approx. 20,000 goals. It will take more than one day to complete this version. I also wrote a parallel function pmap (pcompute-space), which would reduce the computational time to about 10 hours, but I still feel that it should be faster. I have no other code to compare, but my intuition says that it should be faster.

(defn compute-space 
  ([docs context-fn targets]
    (let [space (atom {})]
      (doseq [doc docs
              target targets]
        (when-let [contexts (context-fn target doc)]
          (doseq [w contexts]
            (if (get-in @space [target w])
              (swap! space update-in [target w] (partial inc))
              (swap! space assoc-in  [target w] 1)))))
     @space)))

(defn sliding-window
  [target s n]
  (loop [todo s seen [] acc []]
    (let [curr (first todo)]
      (cond (= curr target) (recur (rest todo) (cons curr seen) (concat acc (take n seen) (take n (rest todo))))
            (empty? todo) acc
            :else (recur (rest todo) (cons curr seen) acc)))))


(defn pcompute-space
  [docs step context-fn targets]
  (reduce
     #(deep-merge-with + %1 %2)
      (pmap
        (fn [chunk]
          (do (tick))
          (compute-space chunk context-fn targets))
        (partition-all step docs)))

jvisualvm, , clojure.lang.Cons, clojure.lang.ChunkedCons clojure.lang.ArrayChunk (. ). , , , ( , , , .., ). , , . , , , / .

jvisualvm memory profile

SPECS

MacPro 2010 2,4 Intel Core 2 Duo 4

Clojure 1.6.0

Java 1.7.0_51 Java HotSpot (TM) 64- VM

GithubGist

+4
2

:

  • 42 ()
  • 105 040 . ()
  • . , , 1 146 190.

. Criterium . , JIT, .

compute-space 22 :

WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 21.989189 sec
    Execution time std-deviation : 471.199127 ms
   Execution time lower quantile : 21.540155 sec ( 2.5%)
   Execution time upper quantile : 23.226352 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 9.4329 % Variance is slightly inflated by outliers

frequencies .

, , , context-fn , . , compute-space. , Clojure, .

(defn compute-context-map-f [documents context-fn target]
  (frequencies (mapcat #(context-fn target %) documents)))

compute-context-map-f, compute-space, compute-space-f here, :

(defn compute-space-f [docs context-fn targets]
  (into {} (map #(vector % (compute-context-map-f docs context-fn %)) targets)))

, , , 65% :

WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 14.274344 sec
    Execution time std-deviation : 345.240183 ms
   Execution time lower quantile : 13.981537 sec ( 2.5%)
   Execution time upper quantile : 15.088521 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 3 outliers in 60 samples (5.0000 %)
    low-severe   1 (1.6667 %)
    low-mild     2 (3.3333 %)
 Variance from outliers : 12.5419 % Variance is moderately inflated by outliers

, {context-word count, ...} .

(defn pcompute-space-f [docs step context-fn targets]
  (into {} (pmap #(compute-space-f docs context-fn %) (partition-all step targets))))

, , , 16% :

user> (criterium.core/bench (pcompute-space-f documents 4 #(sliding-window %1 %2 5) keywords))
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 3.623018 sec
    Execution time std-deviation : 83.780996 ms
   Execution time lower quantile : 3.486419 sec ( 2.5%)
   Execution time upper quantile : 3.788714 sec (97.5%)
                   Overhead used : 13.353852 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 11.0038 % Variance is moderately inflated by outliers

  • Mac Pro 2009 2.66 Quad-Core Intel Xeon, 48 .
  • Clojure 1.6.0.
  • Java 1.8.0_40 Java HotSpot (TM) 64- VM.

TBD

.

.

+4

compute-space

- -

  • ,
  • , .

  • ,
  • .

context-fn , . , .

, . (), , , , .

sliding-windows , - todo seen. , , subvec s.

, - context-fn , . , sliding-windows,

(defn sliding-windows [w s]
  (let [v (vec s), n (count v)
        window (fn [i] (lazy-cat (subvec v (max (- i w) 0) i)
                                 (subvec v (inc i) (min (inc (+ i w)) n))))]
    (map window (range n))))

compute-space contexts-fn :

(defn compute-space [docs contexts-fn target?]
  (letfn [(stuff [s] (->> (map vector s (contexts-fn s))
                          (filter (comp target? first))))]
    (reduce
     (fn [a [k m]] (assoc a k (merge-with + (a k) (frequencies m))))
     {}
     (mapcat stuff docs))))

stuff:

  • stuff [target context-sequence].
  • , -.

500 , : , .

  • 100 000 ,
  • 100 000
  • 10 000

100 .

- 10 000 - 5 .

() , . , - .

,

  • ;
  • .

- Criterium - , , , .

+1

All Articles