Situation
I have a CSV with 13 million lines on which I want to perform logistic regression (incanter) for each group. My file is like this (values ββare just a sample)
ID Max Probability 1 1 0.5 1 5 0.6 1 10 0.99 2 1 0.1 2 7 0.95
So, I first read it using csv-reader, everything is fine.
I have something like this:
( {"Id" "1", "Max" 1, "Probability" 0.5} {"Id" "1", "Max" 5, "Probability" 0.6} etc.
I want to group these values ββusing Id, if I remember correctly, there are about 1.2 million identifiers. (I did it in Python with pandas and it is super fast)
This is my function for reading and formatting a file (it works great on small data sets):
(defn read-file [] (let [path (:path-file @config) content-csv (take-csv path \,)] (->> (group-by :Id content-csv) (map (fn [[kv]] [k {:x (mapv :Max v) :y (mapv :Probability v)}])) (into {}))))
I want, finally, to have something like this to perform logistic regression (I am flexible about this, I do not need vectors for: x and: y, seqs are ok)
{"1" {:x [1 5 10] :y [0.5 0.6 0.99]} "2" {:x [1 7] :y [0.1 0.95]} etc.
Problem
I have problems with group work. I tried it separately on exiting CSV, and this happens forever when it does not die due to Java Heap Space memory. I thought the problem was with my card, but this is a group one.
I was thinking about using abbreviation or abbreviation -kv, but I do not know how to use these functions for such purposes.
I don't care about the order ": x" and ": y" (as soon as they match between them, I mean that x and y have the same index ... no problem, because they are on the same line) and the identifiers of the final result, and I read this group in order. Maybe it's expensive for surgery?
I give you sample data if anyone comes across this:
(def sample '({"Id" "1" "Max" 1 "Probability" 0.5} {"Id" "1" "Max" 5 "Probability" 0.6} {"Id" "1" "Max" 10 "Probability" 0.99} {"Id" "2" "Max" 1 "Probability" 0.1} {"Id" "2" "Max" 7 "Probability" 0.95}))
Other alternatives
I have other ideas, but I'm not sure if they are "Clojure" - friendly.
In Python, due to the nature of the function and because the file is already ordered, instead of using group-by, I wrote at the beginning and end of frame indices for each group, so I just had to select the sub-data directly.
I can also load the list of identifiers instead of calculating it from Clojure. how
(def ids' ("1", "2", etc.
So maybe you can start with:
{"1" {:x [] :y []} "2" {:x [] :y []} etc.
from the previous seq and then map the large file to each identifier.
I do not know if this is actually more effective.
I have all the other functions for logistic regression, I just miss this part! Thanks!
EDIT
Thanks for the answers, I finally have this solution.
In the project.clj file
:jvm-opts ["-Xmx13g"])
Code:
(defn data-group->map [group] {(:Id (first group)) {:x (map :Max group) :y (map :Probability group)}}) (defn prob-cumsum [data] (cag/fmap (fn [x] (assoc x :y (reductions + (x :y)))) data)) (defn process-data-splitter [data] (->> (partition-by :Id data) (map data-group->map) (into {}) (prob-cumsum)))
I wrapped all my code and it works. The split takes about 5 minutes, but I don't need mega speed. Memory usage can be extended to all memory for reading files, and then less for sigmoid.