Microsecond accurate unix time stamp compression

I have a file that consists of a sequence of unix timestamps accurate to microseconds, i.e. timestamps can never decrease. All timestamps that need to be encoded / decoded refer to the same day. The sample record in the file may be something like 1364281200.078739, which corresponds to 1364281200078739 usecs from the era. Data is unevenly distributed and limited.

I need to compress about 10 bits / timestamp. Currently, I can compress to an average of 31 bits / timestamp by calculating the difference between consecutive timestamps. How can I improve further?

Edit:

We calculate the compression ratio as (encoded file size in bytes) / (number of timestamps) * 8. I divided the timestamps into two parts to "." and after him. The integer part is pretty constant, and the maximum difference between two integer timestamps is 32, so I encoded it using 0-8 bits. The exact part is pretty random, so I ignored the leading bits and wrote to the file using 0-21 bits (the maximum may be 999999). But the size of my encoded file is 4007674 bytes and therefore compression as 71.05 bits / TS. I also write '.' and a space between two timestamps for subsequent decoding. How can I improve my encoded file size?

Here is a link for a partial dataset - http://pastebin.com/QBs9Bqv0

Here is the link for the timestamp value in microseconds - http://pastebin.com/3QJk1NDV The maximum b / w timestamp with a difference of 32594136 microseconds.

+4
source share
3 answers

If you take the interval between each timestamp and the previous one and express it in microseconds (i.e. as integers), the distribution of values ​​per bit of depth in your example file:

bit interval depth distribution

, 52.285% 0 1, 64 (2 ~ 6 ), 27,59% 7 ~ 12 , 2,1% 20 , 3% 20 , 25 . , , 6 .

, - :

00 0xxxxx                             0 (xxxxx is the number of consecutive zeros)
00 1xxxxx                             1 (xxxxx is the number of consecutive ones)
01 xxxxxx xxxxxxxx                    2-14 bit values
10 xxxxxx xxxxxxxx xxxxxxxx           15-22 bit values
11 xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  23-30 bit values

, 13,78 , , , .


, , 0 1, 0 1 0, 1- :

00xxxxxx    00     = identifies a one-byte value
            xxxxxx = index in the sequence table

:

index ~ seq    index ~ seq    index ~ seq    index ~ seq    index ~ seq     index ~ seq
  0      0       2     00       6     000     14    0000     30   00000     62   000000
  1      1       3     01       7     001     15    0001     31   00001     63   000001
                 4     10       8     010     16    0010     32   00010    
                 5     11         ...            ...           ...
                               11     101     27    1101     59   11101    
                               12     110     28    1110     60   11110    
                               13     111     29    1111     61   11111    

451,210 676 418 11,99 .

, 98 578 31 271 . 1 , , 592,315 . 2 , 0, 1 00 ( ), 564,034 10 0004 .
0 1 ( ) , 563,884 9.997722 !

, :

Store the first timestamp (8 bytes), then store the intervals as either: 

00 iiiiii                             sequences of up to 5 (or 6) zeros or ones
01 XXxxxx xxxxxxxx                     2-12 bit values (2 ~ 4,095)
10 XXxxxx xxxxxxxx xxxxxxxx           13-20 bit values (4,096 ~ 1,048,575)
11 XXxxxx xxxxxxxx xxxxxxxx xxxxxxxx  21-28 bit values (1,048,576 ~ 268,435,455)

iiiiii = index in sequence table (see above)
XX     = preceded by a zero (if XX=1), a one (if XX=2) or two zeros (if XX=3)
xxx... = 12, 20 or 28 bit value

:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

void write_timestamp(ofstream& ofile, uint64_t timestamp) {    // big-endian
    uint8_t bytes[8];
    for (int i = 7; i >= 0; i--, timestamp >>= 8) bytes[i] = timestamp;
    ofile.write((char*) bytes, 8);
}

int main() {
    ifstream ifile ("timestamps.txt");
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.bin", ios::trunc | ios::binary);
    if (! ofile.is_open()) return 2;

    long double seconds;
    uint64_t timestamp;

    if (ifile >> seconds) {
        timestamp = seconds * 1000000;
        write_timestamp(ofile, timestamp);
    }

    while (! ifile.eof()) {
        uint8_t bytesize = 0, len = 0, seq = 0, bytes[4];
        uint32_t interval;

        while (bytesize == 0 && ifile >> seconds) {
            interval = seconds * 1000000 - timestamp;
            timestamp += interval;

            if (interval < 2) {
                seq <<= 1; seq |= interval;
                if (++len == 5 && seq > 0 || len == 6) bytesize = 1;
            } else {
                while (interval >> ++bytesize * 8 + 4);
                for (uint8_t i = 0; i <= bytesize; i++) {
                    bytes[i] = interval >> (bytesize - i) * 8;
                }
                bytes[0] |= (bytesize++ << 6);
            }
        }
        if (len) {
            if (bytesize > 1 && (len == 1 || len == 2 && seq == 0)) {
                bytes[0] |= (2 * len + seq - 1) << 4;
            } else {
                seq += (1 << len) - 2;
                ofile.write((char*) &seq, 1);
            }
        }
        if (bytesize > 1) ofile.write((char*) bytes, bytesize);
    }
    ifile.close();
    ofile.close();
    return 0;
}

:

#include <stdint.h>
#include <iostream>
#include <fstream>
using namespace std;

uint64_t read_timestamp(ifstream& ifile) {    // big-endian
    uint64_t timestamp = 0;
    uint8_t byte;
    for (uint8_t i = 0; i < 8; i++) {
        ifile.read((char*) &byte, 1);
        if (ifile.fail()) return 0;
        timestamp <<= 8; timestamp |= byte;
    }
    return timestamp;
}

uint8_t read_interval(ifstream& ifile, uint8_t *bytes) {
    uint8_t bytesize = 1;
    ifile.read((char*) bytes, 1);
    if (ifile.fail()) return 0;
    bytesize += bytes[0] >> 6;
    for (uint8_t i = 1; i < bytesize; i++) {
        ifile.read((char*) bytes + i, 1);
        if (ifile.fail()) return 0;
    }
    return bytesize;
}

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << seconds << "\n";
}

uint8_t write_sequence(ofstream& ofile, uint8_t seq, uint64_t timestamp) {
    uint8_t interval = 0, len = 1, offset = 1;
    while (seq >= (offset <<= 1)) {
        seq -= offset;
        ++len;
    }
    while (len--) {
        interval += (seq >> len) & 1;
        write_seconds(ofile, timestamp + interval);
    }
    return interval;
}

int main() {
    ifstream ifile ("timestamps.bin", ios::binary);
    if (! ifile.is_open()) return 1;
    ofstream ofile ("output.txt", ios::trunc);
    if (! ofile.is_open()) return 2;
    ofile.precision(6); ofile << std::fixed;

    uint64_t timestamp = read_timestamp(ifile);
    if (timestamp) write_seconds(ofile, timestamp);

    while (! ifile.eof()) {
        uint8_t bytes[4], seq = 0, bytesize = read_interval(ifile, bytes);
        uint32_t interval;

        if (bytesize == 1) {
            timestamp += write_sequence(ofile, bytes[0], timestamp);
        }
        else if (bytesize > 1) {
            seq = (bytes[0] >> 4) & 3;
            if (seq) timestamp += write_sequence(ofile, seq - 1, timestamp);
            interval = bytes[0] & 15;
            for (uint8_t i = 1; i < bytesize; i++) {
                interval <<= 8; interval += bytes[i];
            }
            timestamp += interval;
            write_seconds(ofile, timestamp);
        }
    }
    ifile.close();
    ofile.close();
    return 0;
}

- MinGW/gcc 4.8.1, , : ( )

void write_seconds(ofstream& ofile, uint64_t timestamp) {
    long double seconds = (long double) timestamp / 1000000;
    ofile << "1" << (double) (seconds - 1000000000) << "\n";
}

: ; , .

+4

, , 10 1024.

, , , , 1 , 10 /.

, , , 10 / . , , 31 - , .

, , , , .

+1

, . Rice Code , , .

0

All Articles