How to compress a large number of such doubles?

I want to store billions (10 ^ 9) of double-precision floating-point numbers in memory and save space. These values โ€‹โ€‹are grouped into thousands of ordered sets (they are time series), and within the set I know that the difference between the values โ€‹โ€‹is usually small (compared to their absolute value). In addition, the closer to each other, the higher the likelihood that the difference will be relatively small.

The perfect fit would be delta coding, which only saves the difference of each value for its predecessor. However, I want random access to subsets of data, so I cannot depend on having to go through the complete set in a sequence. Therefore, I use deltas for the base set width, which gives deltas, which I expect will be between 10 and 50 percent of the absolute value (most of the time).

I reviewed the following approaches:

  • divide the smaller value by the larger one, getting a value from 0 to 1, which can be stored as an integer of a certain fixed precision plus one bit to remember which number was divided by this. It is quite simple and gives a satisfactory compression, but not a lossless method and therefore only a secondary choice.
  • XOR IEEE 754 is a binary encoded representation of both values โ€‹โ€‹and stores the length of long segments of zeros at the beginning of the exponent and mantissa plus the remaining bits, which were different. Here I am completely unsure how to evaluate compression, although I think that in most cases this should be good.

Are there standard ways to do this? What could be the problems with my approaches above? What other solutions have you seen or used yourself?

+4
source share
3 answers

Rarely are all double precision value bits significant.

If you have billions of values โ€‹โ€‹that are the result of some measurements, find the calibration and error of your measuring device. Quantifying values โ€‹โ€‹so that you only work with significant bits.

Often you find that you only need 16 bits of the actual dynamic range. You can compress all this into short arrays that preserve the entire original input.

Use a simple โ€œZ-scoreโ€ technique where each value is truly a signed part of the standard deviation.

Thus, a sequence of samples with an average value of m and a standard deviation of s is converted to a heap of Z-score. Normal Z-score conversions use double, but you must use the double dot version. s / 1000 or s / 16384 or something that only stores the actual accuracy of your data, not the noise bits at the end.

for u in samples: z = int( 16384*(um)/s ) for z in scaled_samples: u = s*(z/16384.0)+m 

Your Z-scores keep pleasant ease in working with statistical relationships with the original samples.


Say you are using a signed 16-bit Z account. You have +/- 32,768. Scale this to 16,384, and your Z-points have an effective resolution of 0.000061 decimal places.

If you use a signed 24-but Z-account, you have +/- 8 million. Scale this to 4,194,304 and you have a resolution of 0.00000024.

I seriously doubt that you have measuring instruments. In addition, any arithmetic performed as part of a filter, calibration, or noise reduction can reduce the effective range due to the noise bits entered during arithmetic. A poorly designed separation operator could make most of your decimal places nothing more than noise.

+9
source

Whichever compression scheme you choose, you can separate it from the problem of having to execute arbitrary queries by compressing into blocks of a fixed size and adding to each block a header containing all the data necessary for unpacking it (for example, for a delta coding scheme, the block will contain deltas that are supported in some way, which uses their small size to make them take up less space, for example, fewer bits for the exponent / mantissa, conversion to a fixed value constant point, Huffman coding, etc., and one uncompressed header pattern); then the search becomes a cheap choice of the appropriate block, and then unpacking it.

If the compression ratio is so variable that a lot of space is wasted filling out the compressed data to create blocks of a fixed size, instead a directory of offsets into the compressed data can be created, and the state necessary for decompression is recorded in this.

+4
source

If you know that a group of doubles has the same exponent, you can save the exponent once and save only the mantissa for each value.

+3
source

All Articles