Clojure: too slow (13 million lines)

Question

Clojure: too slow (13 million lines)

Situation

I have a CSV with 13 million lines on which I want to perform logistic regression (incanter) for each group. My file is like this (values are just a sample)

ID Max Probability 1 1 0.5 1 5 0.6 1 10 0.99 2 1 0.1 2 7 0.95

So, I first read it using csv-reader, everything is fine.

I have something like this:

 ( {"Id" "1", "Max" 1, "Probability" 0.5} {"Id" "1", "Max" 5, "Probability" 0.6} etc.

I want to group these values using Id, if I remember correctly, there are about 1.2 million identifiers. (I did it in Python with pandas and it is super fast)

This is my function for reading and formatting a file (it works great on small data sets):

  (defn read-file [] (let [path (:path-file @config) content-csv (take-csv path \,)] (->> (group-by :Id content-csv) (map (fn [[kv]] [k {:x (mapv :Max v) :y (mapv :Probability v)}])) (into {}))))

I want, finally, to have something like this to perform logistic regression (I am flexible about this, I do not need vectors for: x and: y, seqs are ok)

 {"1" {:x [1 5 10] :y [0.5 0.6 0.99]} "2" {:x [1 7] :y [0.1 0.95]} etc.

Problem

I have problems with group work. I tried it separately on exiting CSV, and this happens forever when it does not die due to Java Heap Space memory. I thought the problem was with my card, but this is a group one.

I was thinking about using abbreviation or abbreviation -kv, but I do not know how to use these functions for such purposes.

I don't care about the order ": x" and ": y" (as soon as they match between them, I mean that x and y have the same index ... no problem, because they are on the same line) and the identifiers of the final result, and I read this group in order. Maybe it's expensive for surgery?

I give you sample data if anyone comes across this:

 (def sample '({"Id" "1" "Max" 1 "Probability" 0.5} {"Id" "1" "Max" 5 "Probability" 0.6} {"Id" "1" "Max" 10 "Probability" 0.99} {"Id" "2" "Max" 1 "Probability" 0.1} {"Id" "2" "Max" 7 "Probability" 0.95}))

Other alternatives

I have other ideas, but I'm not sure if they are "Clojure" - friendly.

In Python, due to the nature of the function and because the file is already ordered, instead of using group-by, I wrote at the beginning and end of frame indices for each group, so I just had to select the sub-data directly.
I can also load the list of identifiers instead of calculating it from Clojure. how
(def ids' ("1", "2", etc.

So maybe you can start with:

 {"1" {:x [] :y []} "2" {:x [] :y []} etc.

from the previous seq and then map the large file to each identifier.

I do not know if this is actually more effective.

I have all the other functions for logistic regression, I just miss this part! Thanks!

EDIT

Thanks for the answers, I finally have this solution.

In the project.clj file

  :jvm-opts ["-Xmx13g"])

Code:

 (defn data-group->map [group] {(:Id (first group)) {:x (map :Max group) :y (map :Probability group)}}) (defn prob-cumsum [data] (cag/fmap (fn [x] (assoc x :y (reductions + (x :y)))) data)) (defn process-data-splitter [data] (->> (partition-by :Id data) (map data-group->map) (into {}) (prob-cumsum)))

I wrapped all my code and it works. The split takes about 5 minutes, but I don't need mega speed. Memory usage can be extended to all memory for reading files, and then less for sigmoid.

+6

group-by clojure incanter

Joseph Yourine Feb 01 '16 at 10:11

source share

1 answer

leetwinski · Accepted Answer · 2016-02-01T12:40:15+0000

If your file is sorted by id, you can use partition-by instead of group-by .

then your code will look like this:

 (defn data-group->map [group] [(:Id (first group)) {:x (mapv :Max group) :y (mapv :Probability group)}]) (defn read-file [] (let [path (:path-file @config) content-csv (take-csv path \,)] (->> content-csv (partition-by :Id) (map data-group->map) (into {}))))

which should speed it up. Then you can probably do it faster with converters

 (defn read-file [] (let [path (:path-file @config) content-csv (take-csv path \,)] (into {} (comp (partition-by :Id) (map data-group->map)) content-csv)))

do some tests:

first create huge data like yours:

 (def huge-data (doall (mapcat #(repeat 1000000 {:Id % :Max 1 :Probability 10}) (range 10))))

We have ten million data sets, with millions {:Id 0 :Max 1 :Probability 10} , millions {:Id 1 :Max 1 :Probability 10} , etc.

now functions to be tested:

 (defn process-data-group-by [data] (->> (group-by :Id data) (map (fn [[kv]] [k {:x (mapv :Max v) :y (mapv :Probability v)}])) (into {}))) (defn process-data-partition-by [data] (->> data (partition-by :Id) (map data-group->map) (into {}))) (defn process-data-transducer [data] (into {} (comp (partition-by :Id) (map data-group->map)) data))

and now time tests:

 (do (time (dorun (process-data-group-by huge-data))) (time (dorun (process-data-partition-by huge-data))) (time (dorun (process-data-transducer huge-data)))) "Elapsed time: 3377.167645 msecs" "Elapsed time: 3707.03448 msecs" "Elapsed time: 1462.955152 msecs"

Note that partition-by creates a lazy sequence, while group-by must implement the whole collection. Therefore, if you need your data group by group, and not the entire map, you can delete (into {}) and access it faster:

 (defn process-data-partition-by [data] (->> data (partition-by :Id) (map data-group->map)))

check:

 user> (time (def processed-data (process-data-partition-by huge-data))) "Elapsed time: 0.06079 msecs" #'user/processed-data user> (time (let [f (first processed-data)])) "Elapsed time: 302.200571 msecs" nil user> (time (let [f (second processed-data)])) "Elapsed time: 500.597153 msecs" nil user> (time (let [f (last processed-data)])) "Elapsed time: 2924.588625 msecs" nil user.core> (time (let [f (last processed-data)])) "Elapsed time: 0.037646 msecs" nil

Clojure: too slow (13 million lines)

More articles: