Processing CSV file in Clojure in parallel

I have a large CSV file that contains independent elements that require a lot of effort to process. I would like to be able to process each position in parallel. I found sample code for processing a CSV file on SO here:

Beginner Converts CSV Files to Clojure

The code:

(use '(clojure.contrib duck-streams str-utils)) ;;' (with-out-writer "coords.txt" (doseq [line (read-lines "coords.csv")] (let [[xyzp] (re-split #"," line)] (println (str-join \space [pxyz]))))) 

It managed to print data from my CSV file, which was great - but it only used one processor. I tried different things, in the end:

 (pmap println (read-lines "foo")) 

This works fine interactively, but does nothing when run from the command line. From a conversation in IRC, this is because stdout is not available for streams by default.

Indeed, what I'm looking for is a way to idiomatically apply a function to each line of a CSV file and do it in parallel. I would also like to print some results for stdout during testing, if at all possible.

Any ideas?

+8
concurrency clojure
source share
3 answers

If you want the output to be in the same order as the input, then printing with pmap might not be a good idea. I would recommend creating a (lazy) pmap input string sequence over this, and then printing the pmap result. Something like this should work:

 (dorun (map println (pmap expensive-computation (read-lines "coords.csv")))) 
+12
source share

If you want to do this with speed, you can watch this article on how Alex Osborne solved the Widefinder 2 problem posed by Tim Bray. Alex understands all aspects of analysis, processing and collection of results (in the case of Widefinder 2, the file is a very large Apache log). The actual code used is here .

+7
source share

I would be very surprised if the hat code could be accelerated using more cores. I am 99% sure that the actual speed limit here is file I / O, which should be a couple of orders of magnitude slower than any single kernel that you can throw on the problem.

And that besides the overhead that you will present by dividing these very minimal tasks into several processors. pmap is not completely free.

If you are sure that IO on disk will not be a problem and you will have a lot of parsing CSV, just parse several files in their own streams to get much more for less effort.

+2
source share

All Articles