Writing a "compressed" array to increase I / O performance?

Question

Writing a "compressed" array to increase I / O performance?

I have an int and float array of size 220 million (fixed). Now I want to store / unload these arrays to / from memory and disk. I am currently using Java NIO FileChannel and MappedByteBuffer to solve this problem. It works fine, but it takes about 5 seconds (Clock Clock Time) to store / load an array to / from memory to disk. Now I want to do it faster.

Here I should mention that most of these array elements are 0 (almost 52%).

as:

int arr1 [] = { 0 , 0 , 6 , 7 , 1, 0 , 0 ...}

Can someone help me if there is a good way to improve speed without saving or loading these 0. This can be compensated with Arrays.fill (array, 0).

+7

java arrays io compression

Arpssss Jun 28 '12 at 17:15

source share

4 answers

Depending on the distribution, consider the Run Length Encoding :

Run length encoding (RLE) is a very simple form of data compression in which data is run (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and are counted, not as the original run . This is most useful for data that contains many such runs.

It's just ... it's good and maybe bad, here; -)

+4

user166390 Jun 28 '12 at 17:19

source share

In case you yourself want to write a serialization-deserialization code, instead of storing all zeros, you can save a number of ranges that indicate where these zeros (with a special marker), along with the actual non-zero data.

So, the array in your example: {0, 0, 6, 7, 1, 0, 0 ...} can be stored as:

% 0-1, 6, 7, 1% 5-6

when reading this data, if you press%, then you have a range from you, you read the beginning and the end and fill in zeros. Then you continue to see no #, this means that you are pushing the actual value.

In a sparse array that has large sequences of consecutive values, this will result in a lot of compression.

+2

Vitaliy Jun 28 '12 at 17:26

source share

There are standard compression utilities in java: java.util.zip is a general-purpose library, but due to its clean availability, this is a good solution. Specialized compression, coding should be investigated if necessary, and I rarely recommend zip as a choice for selection.

Here is an example of how to handle zip through Deflater/Inflater . Most people know ZipInput / Output Stream (and especially Gzip). All of them have recessions when processing copies from mem-> zlib and esp. GZip, which is a complete disaster, since CRC32 calls its own code (calling its own code eliminates the possibility of optimization and introduces a few more calls to performance).

A few important points: do not increase the ZIP compression, which can lead to any performance - of course, you can experiment and match their best ratio between the activity of the processor and the disk.

The code also demonstrates one of the real flaws of java.util.zip - it does not support direct buffers. Support is trivial, but no one bothers her. Direct buffers will save multiple copies of memory and reduce memory footprint.

Last note: there is a java version of (j) zlib , and it surpasses its own impl. compression is pretty nice.

 package t1; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; import java.nio.channels.FileChannel; import java.util.Random; import java.util.zip.DataFormatException; import java.util.zip.Deflater; import java.util.zip.Inflater; public class ZInt { private static final int bucketSize = 1<<17;//in real world should not be const, but we bored horribly static final int zipLevel = 2;//feel free to experiement, higher compression (5+)is likely to be total waste static void write(int[] a, File file, boolean sync) throws IOException{ byte[] bucket = new byte[Math.min(bucketSize, Math.max(1<<13, Integer.highestOneBit(a.length >>3)))];//128KB bucket byte[] zipOut = new byte[bucket.length]; final FileOutputStream fout = new FileOutputStream(file); FileChannel channel = fout.getChannel(); try{ ByteBuffer buf = ByteBuffer.wrap(bucket); //unfortunately java.util.zip doesn't support Direct Buffer - that would be the perfect fit ByteBuffer out = ByteBuffer.wrap(zipOut); out.putInt(a.length);//write length aka header if (a.length==0){ doWrite(channel, out, 0); return; } Deflater deflater = new Deflater(zipLevel, false); try{ for (int i=0;i<a.length;){ i = put(a, buf, i); buf.flip(); deflater.setInput(bucket, buf.position(), buf.limit()); if (i==a.length) deflater.finish(); //hacking and using bucket here is tempting since it copied twice but well for (int n; (n= deflater.deflate(zipOut, out.position(), out.remaining()))>0;){ doWrite(channel, out, n); } buf.clear(); } }finally{ deflater.end(); } }finally{ if (sync) fout.getFD().sync(); channel.close(); } } static int[] read(File file) throws IOException, DataFormatException{ FileChannel channel = new FileInputStream(file).getChannel(); try{ byte[] in = new byte[(int)Math.min(bucketSize, channel.size())]; ByteBuffer buf = ByteBuffer.wrap(in); channel.read(buf); buf.flip(); int[] a = new int[buf.getInt()]; if (a.length==0) return a; int i=0; byte[] inflated = new byte[Math.min(1<<17, a.length*4)]; ByteBuffer intBuffer = ByteBuffer.wrap(inflated); Inflater inflater = new Inflater(false); try{ do{ if (!buf.hasRemaining()){ buf.clear(); channel.read(buf); buf.flip(); } inflater.setInput(in, buf.position(), buf.remaining()); buf.position(buf.position()+buf.remaining());//simulate all read for (;;){ int n = inflater.inflate(inflated,intBuffer.position(), intBuffer.remaining()); if (n==0) break; intBuffer.position(intBuffer.position()+n).flip(); for (;intBuffer.remaining()>3 && i<a.length;i++){//need at least 4 bytes to form an int a[i] = intBuffer.getInt(); } intBuffer.compact(); } }while (channel.position()<channel.size() && i<a.length); }finally{ inflater.end(); } // System.out.printf("read ints: %d - channel.position:%d %n", i, channel.position()); return a; }finally{ channel.close(); } } private static void doWrite(FileChannel channel, ByteBuffer out, int n) throws IOException { out.position(out.position()+n).flip(); while (out.hasRemaining()) channel.write(out); out.clear(); } private static int put(int[] a, ByteBuffer buf, int i) { for (;buf.hasRemaining() && i<a.length;){ buf.putInt(a[i++]); } return i; } private static int[] generateRandom(int len){ Random r = new Random(17); int[] n = new int[len]; for (int i=0;i<len;i++){ n[i]= r.nextBoolean()?0: r.nextInt(1<<23);//limit bounds to have any sensible compression } return n; } public static void main(String[] args) throws Throwable{ File file = new File("xxx.xxx"); int[] n = generateRandom(3000000); //{0,2,4,1,2,3}; long start = System.nanoTime(); write(n, file, false); long elapsed = System.nanoTime() - start;//elapsed will be fairer if the sync is true System.out.printf("File length: %d, for %d ints, ratio %.2f in %.2fms %n", file.length(), n.length, ((double)file.length())/4/n.length, java.math.BigDecimal.valueOf(elapsed, 6) ); int[] m = read(file); //compare, Arrays.equals doesn't return position, so it sucks/kinda for (int i=0; i<n.length; i++){ if (m[i]!=n[i]){ System.err.printf("Failed at %d%n",i); break; } } System.out.printf("All done!"); }; }

_{Please note that the code is not a suitable standard!}
Delayed answers arise due to the fact that it was rather boring to code, another zip example, sorry

+2

bestsss Jul 01 '12 at 12:04

source share

meriton · Accepted Answer · 2012-06-28T17:35:08+0000

The following approach requires n / 8 + nz * 4 bytes on disk, where n is the size of the array and nz is the number of non-zero entries. For 52% of null entries, you reduced the storage size by 52% - 3% = 49%.

You can do:

 void write(int[] array) { BitSet zeroes = new BitSet(); for (int i = 0; i < array.length; i++) zeroes.set(i, array[i] == 0); write(zeroes); // one bit per index for (int i = 0; i < array.length; i++) if (array[i] != 0) write(array[y]); } int[] read() { BitSet zeroes = readBitSet(); array = new int[zeroes.length]; for (int i = 0; i < zeroes.length; i++) { if (zeroes.get(i)) { // nothing to do (array[i] was initialized to 0) } else { array[i] = readInt(); } } }

Edit: you say this is a bit slower, implies that the drive is not a bottleneck. You can customize the aforementioned approach by writing a bitrate when creating it, so you do not need to write the bitteth to memory before writing it to disk. In addition, writing down a word phrase interspersed with actual data, we can only make one pass through the array, reducing cache misses:

 void write(int[] array) { writeInt(array.length); int ni; for (int i = 0; i < array.length; i = ni) { ni = i + 32; int zeroesMap = 0; for (j = i + 31; j >= i; j--) { zeroesMap <<= 1; if (array[j] == 0) { zeroesMap |= 1; } } writeInt(zeroesMap); for (j = i; j < ni; j++) if (array[j] != 0) { writeInt(array[j]); } } } } int[] read() { int[] array = new int[readInt()]; int ni; for (int i = 0; i < array.length; i = ni) { ni = i + 32; zeroesMap = readInt(); for (j = i; j < ni; j++) { if (zeroesMap & 1 == 1) { // nothing to do (array[i] was initialized to 0) } else { array[j] = readInt(); } zeroesMap >>= 1; } } return array; }

(The previous code assumes that array.length is a multiple of 32. If not, write the last fragment of the array in any way you like)

If this also does not reduce the time of proxying, compression is not the way (I do not think that any general-purpose compression algorithm will be faster than indicated above).

Writing a "compressed" array to increase I / O performance?

More articles: