I do not know about a โready-madeโ compression library that is explicitly optimized for HTML content .
However, the HTML text should be compressed quite well using common algorithms (read the bottom of this answer for better algorithms). Usually, all variations on Lempel-Ziv work well in HTML-like languages โโdue to the high repeatability of specific language idioms; GZip , often quoted, uses such an LZ-based algorithm (I think LZ77).
An idea that could improve these general algorithms would be to fill an LZ-type circular buffer with the most common html tags and templates in general. Thus, we will reduce the compressed size using quotes from the very first instance of such a template. This enhancement will be especially sensitive to smaller html documents.
An additional, similar idea is that compression and decompression methods imply (i.e. do not send) information for another compression algorithm of the LZ-x algorithm (say, the Huffman tree in the case of LZH, etc.), with statistics typical of a typical HTML, with care to exclude from characters, count [statistically weighted] instances of characters encoded by reference. Such a filtered character distribution is likely to become closer to that of regular English (or the target website) than the full HTML text.
Unconnected with the above [educated, I hope] guesses, I began to search the Internet for information on this topic.
'found this 2008 scientific article (pdf format) from Przemyslaw Skibinsky of the University of Wroclaw. In the thesis, resume improves by 15% compared to GZIP with comparable compression speed .
Otherwise, I may find the wrong place. There seems to be little interest in this. It is possible that the additional gain compared to a simple or moderately tuned general algorithm was not sufficient enough to justify such an interest even in the early days of Internet-enabled cell phones (when bandwidth was quite high ..).
mjv
source share