Need help on how to encode words using huffman code

how do you encode words using huffman code like NEED

+3
source share
4 answers

Huffman coding mainly uses variable-length bit strings to represent tokens (usually characters with a few exceptions). The more often a token is used, the shorter the length of its bit is, and it is (usually) dynamic when the stream is processed.

Usually there are two special tokens, ESCAPE and END-STREAM.

Encoding supports a dictionary, which is basically a sequence of bit sequences for obtaining a token. Initially, it contains only two special tokens.

The initial bit sequences for ESCAPE and END_STREAM can be 0 and 1 (which does not really matter at the beginning). When the encoder accepts a character not in the dictionary, it displays an ESCAPE and a full-length token, then it adds it and assigns new bit sequences based on the frequency of all tokens.

Thus, your "N" may result in an output stream:

0 xxxxxxxx | +- token code for N +--- ESCAPE 

and a new dictionary:

 ESCAPE:00 END-STREAM:01 N:1 

Then your "E" can lead to the output stream:

 0 xxxxxxxx 0 yyyyyyyy +- token code for E 

and a new dictionary:

 ESCAPE:00 END-STREAM:01 N:10 E:11 

Your next E will not exit ESCAPE, since it is already in the dictionary, so you just output its code (11). It will change the dictionary since E now has a higher score. This does not matter in our situation, since all characters are two binary digits, but in a more complex example, the bit length of the β€œE” marker will be reduced.

When D arrives, the output stream becomes:

 0 xxxxxxxx 0 yyyyyyyy 11 0 zzzzzzzz | +- token code for D +------ second E 

and a new dictionary:

 ESCAPE:00 END-STREAM:011 N:010 E:11 D:10 

Thus, you can see that, as more characters arrive, the length of the common bits decreases, and the number of odd bits increases, which gives you compression. N (or D) in this case receives a 3-digit code, and E - a two-digit code, because there are more of them.

The beauty is that you do not need to store the dictionary with the file, since the ESCAPE sections dynamically create it for de-compression.

In addition, because there is NEVER the END-STREAM token left to the end, the bit length continues to increase. Similarly for ESCAPE, while there are still many new characters, its bit length remains short, but when new characters do not arrive, it suffers the same fate as END-STREAM.

The best case for an input stream (8-bit ASCII) is a file containing only millions of the same character. This costs 9 bits for the first character, then 1 bit for each additional character, and then 2 bits for the end of the stream. This is rapidly approaching a compression ratio of 1 to 8 (97.5%).

The worst case is exactly one of each character that costs 9 bits per character plus the end of the stream - this actually expands the stream by 12.5%.

+3
source

I think you mean Huffman coding . This is an algorithm for lossless data compression. Simply put, you replace the longest and most repeated adjacent data bits with the smallest possible representation (this is how most compressions work). For example, an HTML page can assign a very generic <DIV binary number 01 , decreasing 32 bits of each <DIV to two bits.

This is the main idea. Another trick is how to choose numbers, so you don't need to use a fixed size or delimiter. This is done using the Huffman Tree . Read more in the Wikipedia article.

+1
source

Take a look at Huffman Coding with F # , a blog post that introduces the Huffman encoder / decoder written in F #. It is short and clear.

+1
source

Take a look at my Huffman project in C #: https://github.com/kad0xf/HuffmanSharp

0
source

All Articles