Are compression algorithms specially optimized for HTML content?

Question

Are compression algorithms specially optimized for HTML content?

Are there any lossy or lossless compression algorithms that have been specifically adapted to work with real (erratic and invalid) HTML content?

If not, what HTML features can we use to create such an algorithm? What are the potential performance gains?

In addition, I do not ask the question of serving such content (through Apache or any other server), although it is certainly interesting, but to store and analyze it.

Update: I do not mean GZIP - this is obvious, but rather an algorithm specifically designed to use the characteristics of HTML content. For example, a predictable tag and a tree structure.

+6

html algorithm compression

hmason Mar 10 '10 at 17:22

source share

11 answers

Is gzip compression insufficient for your needs? This gives you a 10: 1 compression ratio, not only with HTML content, but also with JavaScript, CSS files, etc. and are easily accessible on most servers or reverse proxies (for example, Apache mod_deflate , Nginx NginxHttpGzipModule , etc.) And all modern browsers (you can specify both Apache and Nginx to skip compression for specific browsers based on User-Agent .)

You will be surprised at how gzip compression is compressed. Some people suggested reducing your files ; however, if your files do not contain a large number of comments (which the minifier can completely discard, that is, what you probably call "lossy"), but something that you probably do not want to do with HTML, make sure that none of the <script> or <style> tags are inside the HTML comments  for hosting antediluvian browsers), remember that minimization reaches most of its gain from a method similar to (even more limited) DEFLATE - so expect the smaller file to be larger or much larger than the gzip original original (this is especially true with HTML, in which you are stuck with W3C tags and attributes, and only gzip can help you), and that the gzip ping mini file will give you minimal gain compared to the gzip source file (again, if the source file does not contain a lot comments that can be safely discarded using minifier.)

+2

vladr Mar 10 '10 at 17:32

source share

About the only "lost" with which I am ready to work in HTML content, randomly or not, is alignment in the form of spaces. This is a typical post-published step that high-performing sites perform on their content, also called flattening.

You can also smooth out large Javascript libraries with the YUI compressor, which renames all Javascript vars to short names, removes spaces, etc. This is very important for large applications using kits such as ExtJS, Dojo, etc.

+1

codenheim Mar 10 '10 at 17:27

source share

if I understand your question correctly, what you need is gz compression, which is available quite easily with Apache.

0

Sabeen malik Mar 10 '10 at 17:25

source share

No, there are no HTML-specific compression algorithms, because universal ones turned out to be adequate.

The potential benefits will come from knowing in advance the possible elements of an HTML page - you can start with a predefined dictionary that should not be part of a compressed stream. But this would not bring significant benefits, since compression algorithms are unusually good at collecting common subexpressions on the fly.

0

Mark ransom Mar 10 '10 at 17:31

source share

Usually you use a general algorithm such as gzip, which is supported by most browsers via the HTTP protocol. The Apache documentation shows how to enable mod_deflate without breaking browser support for your website.

In addition, you can minimize static HTML files (or do it dynamically).

0

Andidog Mar 10 '10 at 17:31

source share

Use S expressions instead, storing multiple characters for each tag :)

0

leppie Mar 10 '10 at 18:05

source share

You can consider each unique grouping (i.e. tags and attributes) as a symbol, determine the minimum number of characters, and encode using Shannon's entropy; this will create one large block of bytes with maximum compression. I will say that this may not be much better than gzip.

0

Hedzer Mar 25 '14 at 0:43

source share

Brotli is a specialized HTML / english compression algorithm.

Source: https://en.wikipedia.org/wiki/Brotli

Unlike most general-purpose compression algorithms, Brotli uses a predefined 120 kilobyte dictionary. The dictionary contains 13,000 common words, phrases, and other substrings derived from a large body of text and HTML documents. [6] [7] A predefined dictionary can give density compression for short data files.

0

Ryan hamilton Oct 19 '16 at 9:59

source share

Gzip is used to compress web pages (e.g. HTML), but some versions of IE do not support it.

Wikipedia article

-one

erenon Mar 10 '10 at 17:26

source share

Run your code through some kind of HTML minifier / obfuscator that removes as much markup as possible, and then let your web server compress it using gzip.

-one

Tronic Mar 10 '10 at 17:30

source share

mjv · Accepted Answer · 2010-03-10T17:32:59+0000

I do not know about a “ready-made” compression library that is explicitly optimized for HTML content .

However, the HTML text should be compressed quite well using common algorithms (read the bottom of this answer for better algorithms). Usually, all variations on Lempel-Ziv work well in HTML-like languages due to the high repeatability of specific language idioms; GZip , often quoted, uses such an LZ-based algorithm (I think LZ77).

An idea that could improve these general algorithms would be to fill an LZ-type circular buffer with the most common html tags and templates in general. Thus, we will reduce the compressed size using quotes from the very first instance of such a template. This enhancement will be especially sensitive to smaller html documents.

An additional, similar idea is that compression and decompression methods imply (i.e. do not send) information for another compression algorithm of the LZ-x algorithm (say, the Huffman tree in the case of LZH, etc.), with statistics typical of a typical HTML, with care to exclude from characters, count [statistically weighted] instances of characters encoded by reference. Such a filtered character distribution is likely to become closer to that of regular English (or the target website) than the full HTML text.

Unconnected with the above [educated, I hope] guesses, I began to search the Internet for information on this topic.

'found this 2008 scientific article (pdf format) from Przemyslaw Skibinsky of the University of Wroclaw. In the thesis, resume improves by 15% compared to GZIP with comparable compression speed .

Otherwise, I may find the wrong place. There seems to be little interest in this. It is possible that the additional gain compared to a simple or moderately tuned general algorithm was not sufficient enough to justify such an interest even in the early days of Internet-enabled cell phones (when bandwidth was quite high ..).

Are compression algorithms specially optimized for HTML content?

More articles: