I am trying to parse a file with a million lines, each line is a json line with some information about the book (author, content, etc.). I am using iota to download a file since my program throws OutOfMemoryError if I try to use slurp . I also use cheshire for parsing strings. The program simply downloads the file and counts all the words in all books.
My first attempt included pmap to do the hard work, I figured it essentially uses all of my processor cores.
(ns multicore-parsing.core (:require [cheshire.core :as json] [iota :as io] [clojure.string :as string] [clojure.core.reducers :as r])) (defn words-pmap [filename] (letfn [(parse-with-keywords [str] (json/parse-string str true)) (words [book] (string/split (:contents book)
While it seems like it uses all the cores, each core rarely uses more than 50% of its capacity, I assume this is due to the lot size of the pmap, and so I came across a relatively old question where some comments refer to the library clojure.core.reducers .
I decided to rewrite the function using reducers/map :
(defn words-reducers [filename] (letfn [(parse-with-keywords [str] (json/parse-string str true)) (words [book] (string/split (:contents book)
But CPU usage is worse, and even longer than the previous implementation, it takes more time.
multicore-parsing.core=> (time (words-pmap "./dummy_data.txt")) "Elapsed time: 20899.088919 msecs" 546 multicore-parsing.core=> (time (words-reducers "./dummy_data.txt")) "Elapsed time: 28790.976455 msecs" 546
What am I doing wrong? Is mmap + loading the correct approach when analyzing a large file?
EDIT: this is the file I use.
EDIT2: The following are the timings with iota/seq instead of iota/vec :
multicore-parsing.core=> (time (words-reducers "./dummy_data.txt")) "Elapsed time: 160981.224565 msecs" 546 multicore-parsing.core=> (time (words-pmap "./dummy_data.txt")) "Elapsed time: 160296.482722 msecs" 546