Clojure / Java: The most efficient way to minimize bandwidth consumption when performing complex operations on Amazon S3 data stream

Question

Clojure / Java: The most efficient way to minimize bandwidth consumption when performing complex operations on Amazon S3 data stream

I am doing a streaming read of an object using BufferedReader.

I need to do two things with this object:

Pass it to the csv reader SuperCSV
Get the raw strings and save them in a lazy sequence <(w120>)

Currently, I have to use two different BufferedReaders: one as an argument to the SuperCSV CSV reading class and one to initialize the lazy sequence of source lines. I am effectively loading the S3 object twice, which is expensive ($) and slower.

One of my colleagues noted that something similar to the tee Unix command is what I'm looking for. BufferedReader, which can be somehow “broken”, load a piece of data and transfer a copy for both the lazy sequence and the csv read function. It would be helpful.

I am currently also studying whether it is possible to wrap a lazy sequence in a BufferedReader and pass it to super csv. I had several problems with the Java heap when passing very large lazy sequences to multiple consumers, so I am a bit worried about using this solution.

Another solution is to simply upload the file locally and then open two streams in that file. This eliminates the original motivation for streaming: it allows you to start working with the file as soon as the data begins.

The final solution, which I would consider only if nothing works, is implemented by my own CSV reader, which returns both the analyzed CSV and the original non-parameterized string. If you used a very robust CSV reader that can return both a Java hash of analyzed CSV data and the original string without links, please let me know!

Thanks!

+4

java clojure amazon-s3 lazy-sequences bufferedreader

jkndrkn Aug 26 '10 at 22:45

source share

2 answers

I would be inclined to go about creating seq strings from the network and then pass this on, even though many processes should work on that seq; Thus, resilient data structures look great. If necessary, turn a series of lines into a Reader, which you can transfer to the SuperCSV api, this works:

  (import '[java.io Reader StringReader])

 (defn concat-reader
   "Returns a Reader that reads from a sequence of strings."
   [lines]
   (let [srs (atom (map # (StringReader.%) lines))]
     (proxy [Reader] []
       (read 
         ([] 
           (let [c (.read (first @srs))]
             (if (and (neg? c) (swap! srs next))
               (.read this)
               c)))
         ([cbuf] 
           (.read this cbuf 0 (count cbuf)))
         ([cbuf off len]
           (let [actual (.read (first @srs) cbuf off len)]
             (if (and (neg? actual) (swap! srs next))
               (.read this cbuf off len)
               actual))))
       (close [])))))

eg.

  user => (def r (concat-reader ["foo" "bar"]))
 # 'user / r
 user => (def cbuf (char-array 2))
 # 'user / cbuf
 user => (.read r cbuf)
 2
 user => (seq cbuf)
 (\ f \ o)
 user => (char (.read r))
 \ o
 user => (char (.read r))
 \ b

+2

Alex taggart Aug 27 '10 at 3:22

source share

jkndrkn · Accepted Answer · 2010-10-28T15:24:43+0000

The solution was to use one BufferedReader for all calls, and then reset () in it every time it is passed into functionality that needs to be read from the very beginning.

Clojure / Java: The most efficient way to minimize bandwidth consumption when performing complex operations on Amazon S3 data stream

More articles: