Compression of lines with common parts

I have an application that manages a large number of lines. Lines are in the form of a path and have many common parts, but without a clear rule. They are not paths in the file system, but can be considered as such. I clearly need to optimize memory consumption, but without a big sacrifice of performance.

I am considering 2 options:
- implement a compressed_string class that stores data in zipped, but I need a fixed dictionary, and I can’t find a library for this right now. I do not want Huffman in bytes, I want it in words.
- implement some kind of flyweight pattern on string parts.

The problem looks like a normal one, and I wonder what is the best solution for it, or if someone knows a library that targets this problem.

thanks

+4
source share
2 answers

Lines are in the form of a path and have many common parts, but without a clear rule.

In the sense that they are locators in the hierarchy of the form name (delimiter, name) *? If so, you can use interning : save parts of the name as char const * elements that point to a string pool. Thus, you effectively compress the name that is used n times only for n * sizeof(char const *) + strlen(name) bytes. The full path will become a sequence of interned names, for example. a std::vector .

It might seem that sizeof(char const *) large on 64-bit hardware, but you also save some of the distribution costs. Or, if for some reason you know that you will never need more than, say, 65,536 lines, you can save them as

 class interned_name { uint16_t tab_idx; public: char const *c_str() const { return NAME_TABLE[tab_idx]; } }; 

where NAME_TABLE is static std::unordered_map<uint16_t, char const *> .

0
source

While it may be tempting to tweak a specific algorithm for your problem, it is likely to take an unreasonable amount of time and effort, while standard compression methods will immediately give you a big boost to solve the memory consumption problem.

The “standard” way to deal with this problem is to cut the source data into small blocks (for example, 256 KB) and compress them individually. When accessing data in a block, you must first decode it. Therefore, the optimal block size really depends on your application, that is, the more application flows, the more blocks; on the other hand, the more arbitrary the access pattern, the smaller the block size.

If you are concerned about the compression / decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), then, like LZ4, you get about 1 GB / s of decoding performance per core, so this gives you an idea of ​​how many blocks per second you can decode.

If only decompression speed is important, you can use the high compression option LZ4-HC, which will further increase the compression ratio by about 30% and also improve the decompression speed.

+1
source

All Articles