Maybe I have some suggestions for you.
First of all, why are you looking for a software solution implemented by yourself?
You can simply split the large log file into pieces, that is, for an hour or even a minute, and collect them in a separate dictionary on a daily basis (so as not to clutter up the FS with a huge number of files in one directory), so instead of having a large large file that you need to process and look for, you will have many small files that can be quickly accessed by a file name combined according to simple rules. Having a large file is a bad idea (until you have some kind of index), since you need to look for it to find the right information (for example, by the date of work), and the search operation will be much longer.
The situation gets even worse when compression occurs, since you have to unpack the data in order to search for it or create some kind of index. There is no need to do it yourself, you can enable folder compression in the OS and get all the benefits of compression transparently without any encoding.
So, I would suggest not reinventing the wheel (except that you really need it, see below):
- Split log data on a regular basis, for example. per hour to reduce compression performance
- Enable OS folder compression
In general, you will reduce storage space.
Collapse by yourself (in case you really want it). You can do the same, split the data into pieces, compress each one and save it in your storage. To implement something like this, I would think of the following:
- save one file with unprocessed (uncompressed) data, where you will register new information;
- save and update the index file, for example. with saved date ranges per piece to quickly find the file position in the compressed data by date;
- save a file for storing compressed data, each fragment in it contains its size and compressed data (for example, using GZipStream);
So, you will write information to the uncompressed part to some state, then compress it and add the tail to the compressed part, updating the index file. Saving the index file as a separate one allows a quick update without overwriting the huge compressed part.
I would also suggest thinking about why you have such large log files. Perhaps you can optimize your storage format. For instance. if your logs are text files, you can switch to binary format, for example, build a dictionary from the source lines and save only message identifiers instead of full data, i.e.:
update area 1;
update area 2;
data compression;
save as:
x1 1
x1 2
x2
The lines above are just an example; you can “unzip” them at run time, if necessary, using the reassignment data. You can save quite a bit of space by switching to binary and maybe enough to forget about compression.
I have no ready-made implementation or algorithm. Others may offer better, but I hope that my thoughts will be somewhat useful.