Adding to Compressed Stream

I need a solution that allows me to create compressed data files (gzip, zip. Tar, etc. - any format can work), and then freely add data to them without having to load the entire file into memory and re-compress it (search when unpacking would also be awesome). Anyone have a suggestion for .NET?

+4
source share
3 answers

The reason why you basically can not do this is that all modern compression algorithms are based on dictionaries that are supported (added, deleted) when the compressor moves along the input, and again when it generates output.

To add a compressed stream (resume compression), you will need a dictionary in the state that it had when pausing compression. Compression algorithms are not stored in the dictionary, because it will be a waste of space - it is not needed for decompression; it is again created from compressed input during the decompression phase.

I would probably split the output into pieces that are compressed separately.

+2
source

Maybe I have some suggestions for you.

First of all, why are you looking for a software solution implemented by yourself?

You can simply split the large log file into pieces, that is, for an hour or even a minute, and collect them in a separate dictionary on a daily basis (so as not to clutter up the FS with a huge number of files in one directory), so instead of having a large large file that you need to process and look for, you will have many small files that can be quickly accessed by a file name combined according to simple rules. Having a large file is a bad idea (until you have some kind of index), since you need to look for it to find the right information (for example, by the date of work), and the search operation will be much longer.

The situation gets even worse when compression occurs, since you have to unpack the data in order to search for it or create some kind of index. There is no need to do it yourself, you can enable folder compression in the OS and get all the benefits of compression transparently without any encoding.

So, I would suggest not reinventing the wheel (except that you really need it, see below):

  • Split log data on a regular basis, for example. per hour to reduce compression performance
  • Enable OS folder compression

In general, you will reduce storage space.


Collapse by yourself (in case you really want it). You can do the same, split the data into pieces, compress each one and save it in your storage. To implement something like this, I would think of the following:

  • save one file with unprocessed (uncompressed) data, where you will register new information;
  • save and update the index file, for example. with saved date ranges per piece to quickly find the file position in the compressed data by date;
  • save a file for storing compressed data, each fragment in it contains its size and compressed data (for example, using GZipStream);

So, you will write information to the uncompressed part to some state, then compress it and add the tail to the compressed part, updating the index file. Saving the index file as a separate one allows a quick update without overwriting the huge compressed part.


I would also suggest thinking about why you have such large log files. Perhaps you can optimize your storage format. For instance. if your logs are text files, you can switch to binary format, for example, build a dictionary from the source lines and save only message identifiers instead of full data, i.e.:

update area 1;

update area 2;

data compression;

save as:

x1 1

x1 2

x2

The lines above are just an example; you can “unzip” them at run time, if necessary, using the reassignment data. You can save quite a bit of space by switching to binary and maybe enough to forget about compression.

I have no ready-made implementation or algorithm. Others may offer better, but I hope that my thoughts will be somewhat useful.

+1
source

Have you seen the GZipStream class? You can use it like any other thread.

0
source

All Articles