Huffman Coding - Lossless Data Compression

Very Early Data Compression:

The Morse Code and the Telegraph: was developed in the 1830s and 1840s and used electric pulses sent down a wire to control a "receiver" electromagnet. There were three basic signals, a short pulse or dot, a long pulse or dash and "pause" for spacing. Different length pauses represented different separators. A characteristic of Morse Code is that, to some extent, common letters use shorter sequences of dots and dashes.

A major milestone in global communication was the transatlantic telegraph cable project began in 1857. It was not effectively completed and put into service until July 28, 1866. however On August 16, 1858 President James Buchanan and Queen Victoria did exchange messages across the Atlantic.

Presidents Lincoln and Obama in their respective Situation Rooms monotoring developments, using modern technology.

The image of President Lincoln is from the new movie, "Lincoln."

(See pages: 183-185 , Reza: An Introduction to Information Theory)

_________________________________________________________________________________________________________________________________

Class Note Regarding Information Redundancy in English

R*pl*c* v*w*ls w*th g*n*r*c symb*ls.

_________________________________________________________________________________________________________________________________

Huffman Coding

MacKay's Lecture 4 "How to Compress a Redundant File"

Assumptions:

The data can be transmitted without error.
We are going to use strings of 1's and 0's to encode symbols to transmit code.
We can use variable length strings to encode symbols but we want the code to be Prefix free in the sense that no code string in the prefix of another. In the literature this is just called a Prefix Code.
Given an "alphabet" of symbols and their probabilities of being transmitted, We want the code to be efficient in the sense that the expected length of transmitted strings will be the shortest.

David Huffman gave a "best possible" algorithm for doing this. The algorithm for generating a Huffman tree is very simple and for the moment we will just present one, hopefully sufficiently general, example.

The Process:

We begin with a table of symbols and their probabilities. The given probabilities are just suggestive.

Symbol	Probability
A	5 / 21
B	2 / 21
C	1 / 21
D	2 / 21
E	7 / 21
F	1 / 21
G	2 / 21
H	1 / 21

Next, dropping the "redundant ' /21'" and replacing "Probability" with "Frequency," we rearrange the list so that the Symbol's with the lowest Frequency are at the bottom. Of course, this step may not have a unique result.

Symbol	Frequency
E	7
A	5
B	2
D	2
G	2
C	1
F	1
H	1

Next, starting at the bottom, with two symbols of least frequency we combine them (considering the result a single character) and, if necessary rearrange, the table.

Symbol	Frequency
E	7
A	5
B	2
D	2
G	2
[F,H]	2
C	1

Repeating this process until we have a single row we get.

Symbol	Frequency
E	7
A	5
[[F,H],[C]]	3
B	2
D	2
G	2

--->

Symbol	Frequency
E	7
A	5
[D,G]	4
[[F,H],C]	3
B	2

--->

Symbol	Frequency
E	7
A	5
[[[F,H],C],B]	5
[D,G]	4

--->

Symbol	Frequency
[[[[F,H],C],B],[D,G]]	9
E	7
A	5

--->

And finally for the purposes of this example we get (If we were being very formal, we would go a step further).

Symbol	Frequency
[E,A]	12
[[[[F,H],C],B],[D,G]]	9

The next step, somewhat reversing the previous process, is to use the expressions in Symbol to construct a binary tree code.

Symbol	Code
[E,A]	0
[[[[F,H],C],B],[D,G]]	1

Symbol	Code
E	00
A	01
[[[F,H],C],B]	10
[D,G]	11

Symbol	Code
E	00
A	01
[[F,H],C]	100
B	101
D	110
G	111

Symbol	Code
E	00
A	01
[F,H]	1000
C	1001
B	101
D	110
G	111

The binary code being:

Symbol	Code
E	00
A	01
F	10000
H	10001
C	1001
B	101
D	110
G	111

Comments:

The code is balanced in terms of 1s and 0 s .
Note that: p_F ≤ p_H = p_C ≤ p_B = p_D ≤ p_G ≤ p_A ≤ p_E
and l _F ≥ l _H ≥ l _C ≥ l_B ≥ l _D ≥ l _G ≥ l _A ≥ l _E (length)
There are two codes of longest length, in this case 5. This is not a coincidence thought there may be more than two of longest length.
Since we are only dealing with 8 symbols, we could encode them with binary strings of fixed length 3. However, E and A occur with total frequency 12 but C, F, and H occur with total frequency 3.
B, D, G are encoded with binary strings of length 3 in either case. The Huffman code is optimal in the sense that the expected length of messages are shortest.
In particular: E(transmission length) = 2*(7/21)+2*(5/21)+5*(1/21)+5*(1/21)+4*(1/21)+3*(2/21)+3*(2/21)+3*(2/21)=2.67

Exercise: Apply the Huffman algorithm to the following symbol table. Note the lengths of the code words and the exponents of the various probabilities .

Symbol	Probability
A	1 /2³
B	1 / 2⁴
C	1 / 2⁴
D	1 / 2³
E	1 / 2¹
F	1 / 2⁴
G	1 / 2⁴