How to process big binary data in Clojure?

How to handle large binary data files in Clojure? Suppose the data / files are about 50 MB - small enough for processing in memory (but not with a naive implementation).

The following code correctly removes ^ M from small files, but it throws an OutOfMemoryError for large files (e.g. 6MB):

 (defn read-bin-file [file] (to-byte-array (as-file file))) (defn remove-cr-from-file [file] (let [dirty-bytes (read-bin-file file) clean-bytes (filter #(not (= 13 %)) dirty-bytes) changed? (< (count clean-bytes) (alength dirty-bytes))] ; OutOfMemoryError (if changed? (write-bin-file file clean-bytes)))) ; writing works fine 

It seems that Java byte arrays cannot be considered as seq , as they are extremely inefficient.

On the other hand, solutions with aset , aget and areduce bloated, ugly and necessary because you cannot use the Clojure sequence library.

What am I missing? How to handle large binary data files in Clojure?

+6
binary-data clojure
source share
1 answer

I would probably personally use aget / aset / areduce here - they can be imperative, but they are useful when working with arrays, and I don't find them particularly ugly. If you want to wrap them in a nice function, then of course you can :-)

If you decide to use sequences, your problem will be building and traversing seq, since this will require creating and storing a new seq object for each byte in the array. This is probably ~ 24 bytes for each byte of the array ......

So, the trick is to make it work lazily, in which case previous objects will collect garbage before you reach the end of the array. However, to complete this work, you will have to avoid referencing the seq chapter when you go through a sequence (e.g. with a count).

The following may work (unverified), but will depend on the bin file executable lazily friendly:

 (defn remove-cr-from-file [file] (let [dirty-bytes (read-bin-file file) clean-bytes (filter #(not (= 13 %)) dirty-bytes) changed-bytes (count (filter #(not (= 13 %)) dirty-bytes)) changed? (< changed-bytes (alength dirty-bytes))] (if changed? (write-bin-file file clean-bytes)))) 

Note that this is essentially the same as your code, but it creates a separate lazy sequence for counting the number of bytes changed.

+6
source share

All Articles