Java-based quick compression tool int []

In Java, at some point in my program, I have to process gigabytes of int[] arrays in memory. They are sorted and contain only natural (e.g. 1, 2, 3, 4 , ..., up to n ) numbers that represent lines of files. The number n is the number of lines in the file, and it can be a maximum of 100000 . Thus, arrays are simply subsets of the set of all lines in a file. As you can calculate, there are millions of such subsets, and some of them can weigh a lot. As for the distribution of data within these subsets (let's call them arrays now), this is absolutely random: it can be a long array of numbers 50000 , and a small one - only with numbers 1500 ; and each array contains unpredictable sequences, such that they can be [3, 10, 11, 12, 13, 14, 15, 135, 136, ...] or [2, 3, 746, 7889, 7892, 80000,...] .

Since I have many arrays for compression / decompression, I would like to find the fastest solution in terms of the time taken to execute. Therefore, the overhead should be as low as possible.

Which library would you recommend?

+4
source share
3 answers

You can preproject lossless data to improve compression. Leave the first value as is. Make each subsequent value the difference between it and the previous value minus one. Are you sure that such differences are non-negative. Now encode each integer as a variable-length integer using sequences of bytes. For instance. when decoding 0.127 - one byte. If the high bit of this first byte is set (128..255), then take the low seven bits as the low seven bits of an integer and get the next byte. Use the entire byte if the most significant bit is zero as the next eight more significant bits, or only the least seven bits if the most significant bit is one. Continue until you reach a byte with a high bit equal to zero, which means the end of an integer.

Now you encoded the integers as a sequence of bytes, perhaps a little shorter than encoding each source integer, for example, four or eight bytes. In addition, you can now apply any standard compression technique that works in a series of bytes and can potentially expect some gains from this. For instance. if the series of consecutive line numbers are common, then you get a line with zero bytes, which is strongly compressed.

For maximum compression and decompression speeds while reducing compression, look at lz4 . If you don't need something fast, check out zlib , where you can choose the speed and efficiency of compression with the compression level.

For your examples, a random selection of 1500 out of 10000 results in compression of about 1720 bytes, compression of 1600 bytes. Random samples of 50,000 out of 100,000 results in compressed form no more than 50,000 bytes, 18600 bytes are compressed. Compressions were performed with the fastest zlib compression, level 1.

Please note that in the latter case, when half of the line numbers are used, it would be more efficient to use a bit array that would be uncompressed 12,500 bytes. In this case, the data cannot be compressed, since the bitmap seems random (half of the bits are set, half is not set). More or less, for example. 25,000 or 75,000, both results in bitmaps that are compressed to about 10,500 bytes.

Compressed bitmaps are smaller for approximately 12,500 line numbers and higher, while spaced variables integers are less than less than 12,500 line numbers. This clipping is the point at which both approaches have approximately the same uncompressed size of 12,500 bytes.

+3
source

I recommend snappy-java which is google snappy port

+1
source

Perhaps this will help you: Compressing an array of integers in java

Do I need to do a lot of calculations on arrays or read only?

Edit:

 //If the space is more important than performance this might work: //Not this might be totally stupid for some cases // First element should be false since its the 0 ;) boolean[] numbers = { false, true, true, true, false, false, true }; for (int i = 0; i <= numbers.length - 1; i++) { if (numbers[i]) { // or do some calculations on/with a copy of i System.out.println(i); } } 

Since Boolean arry uses 1 byte to store each information (+ service data) This will mean a maximum of 100,000 records: 100'000 bytes = ~ 97 kb for each array

0
source

All Articles