How to efficiently process 300+ files at the same time in scala

Question

How to efficiently process 300+ files at the same time in scala

I will work on comparing about 300 binaries using Scala, by-by-bytes, 4 MB each. However, judging by what I have already done, processing 15 files simultaneously using java.BufferedInputStream took about 90 seconds on my machine, so I don’t think my solution will scale well in terms of a large number of files.

Ideas and suggestions are much appreciated.

EDIT: The actual task is not just comparing the difference, but processing these files in the same order of sequence. Let's say I have to look at byte i in each file at the same time and go to (ith + 1).

+4

scala file-io

Ekkmanz Nov 14 '09 at 4:05

source share

5 answers

Have you noticed how your hard drive slowly evaporates while reading files? Reading that many files are not parallel, mechanical hard drives are designed to work at full speed.

If the files will always be so small (4 MB is quite a lot), I would read the entire first file in memory, and then compare each file with it sequentially.

I can not comment on solid state drives, since I do not have direct experience of their work.

+6

zildjohn01 Nov 14 '09 at 4:13

source share

Are the files exactly the same number of bytes? If this is not the case, files can be compared simply using the File.length() method to determine the first-order equality assumption.

Of course, you might want to make a much deeper comparison than just "are these files the same?"

+1

oxbow_lakes Nov 14 '09 at 13:29

source share

If you just want to see if they match, I would suggest using a hashing algorithm like SHA1 to find out if they match. Here is what java source to make this happen

many large systems that handle the use of sha1 data. Including NSA and git. Its just more efficient use of a hash instead of comparing bytes. hashes can also be saved for later viewing if data has been changed.

Here is Linus Torvalds story about git, he also mentions why he uses SHA1.

+1

Luigimax Nov 14 '09 at 18:13

source share

I would suggest using nio if possible. An introduction to Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading the file and doing comparison bytes by bytes if that is what you are doing now. You can create a ByteBuffer to read in chunks of data from a file, and then perform comparisons with it.

0

faran Nov 14 '09 at 4:31

source share

Daniel C. Sobral · Accepted Answer · 2009-11-14T17:20:42+0000

In fact, you are completely screwed.

Let's see ... 300 * 4 MB = 1.2 GB. Does this fit your memory budget? If so, read them all in mind. But to speed things up, you can try the following:

Read 512KB of each file, sequentially. You can try reading from 2 to 8 at the same time - perhaps through Futures and see how well it scales. Depending on your I / O system, you may get some speed by reading multiple files at the same time, but I do not expect it to scale. EXPERIMENT! REFERENCE!
Process these 512K with Futures .
Go back to step 1 if you are not done with the files.
Get the result from Futures processing.

In step number 1, by limiting concurrent reading, you avoid bypassing your I / O subsystem. Click on it as much as possible, maybe a little less than that, but definitely nothing more.

Without reading all the files in step number 1, you are using some time spent reading these files, doing useful CPU work. You can experiment with decreasing the bytes read in step 1.

How to efficiently process 300+ files at the same time in scala

More articles: