Huffman's Algorithm and Other Topics

Reviewing, we considered a set of $\QTR{bf}{n}$ plaintext-probability pairs MATH that is the probability that the symbol $\QTR{bf}{a}_{i}$ occurs in a message is $\QTR{bf}{p}_{i}$ . We also considered an instantaneous binary code MATH for MATH. Finally we set MATH

We were then able to show that

Theorem

:

MATH

MATH

That is the expected length of a coded symbol is bounded below by $\QTR{bf}{H(P)}$. If I had a message $\QTR{Large}{m}$ symbols of $\QTR{Large}{P}$ long and I encoded it using $\QTR{bf}{C}$ then $\QTR{bf}{mH(P)}$ is a lower bound for the expected length of the coded message.

The key point in the proof was the observation that instantaneous codes were those where the code strings occur at the leaves of the binary tree representation of the code. Hence we have the formula.

MATH

$\vspace{1pt}$

One wants to know how to achieve the lower bound. In fact this is not a "hard" problem like factoring primes. Brute force is even a viable method (the number of interior nodes, relative to $\QTR{Large}{m}$, of the trees we need to search can't get too big, why?). However there is a rather simple algorithm to general a "best possible" code.

At the root of Huffman's algorithm is a simple idea, which is...choose the longest binary string for the least probable symbol because this would reduce the expected length of a transmission.

That is:

if MATH and MATH then MATH.

Hence MATH where the first summation is the result of interchanging $\QTR{Large}{d}_{i}$ and $\QTR{Large}{d}_{j}$ in the second.

An Example: The idea of the algorithm is to build a tree from the leaves back towards the root. At each stage we "combine" the two least probable symbols, in effect adding a new node to the tree.

a .4
b .3
c .1
d .1
e .1
a .4
b .3
de .2
c .1
a .4
b .3
cde .3
bcde .6
a .4

Figure

a 0
b 10
c 110
d 1110
e 1111

______________________________________________________________________________________________________

Secrecy

Suppose we have a cryptosystem. MATH and an encrypted message $\QTR{Large}{c}~$the big question is can we find the original plaintext message $\QTR{Large}{p}$ . Information Theory provide a general framework to study this and related questions. Unfortunately the practical applications of this framework are somewhat limited, at least for our purposes. On the other hand, it is worthwhile to briefly review this area.

The first thing that must be done is to define "Conditional Entropy." That is given events $\QTR{Large}{X}$ and $\QTR{Large}{Y}$ we want to consider MATH the amount of uncertainty in $\QTR{Large}{X}$ knowing after $\QTR{Large}{Y}$ is revealed. The formula defining of MATH is an somewhat obvious extension of $\QTR{Large}{H(X)}$ . From the Information Theoretic point of view the formulation involves the three sets MATH. For example, the following definition:

Definition

: We say that a cryptosystem has perfect secrecy if MATH

In simple terms, the amount of uncertainty about a message is the same whether or not we know the encrypted message. In order for this definition to make sense we need to remember that for a message $\QTR{bs}{p}$ to be the plaintext of an encrypted message $\QTR{bs}{c}$ there needs to be some key $\QTR{bs}{k}~$with

MATH

$\vspace{1pt}$

Examples:

MATH

MATH

Questions:

  1. What about SHIFT and SUBSTITUTION Cyphers?

  2. What about RSA?

  3. Consider MATH, remember that if I have an encrypted message and the Key that was used then I can decode the message, is there a relationship between MATH and MATH?

  4. In all the about we are more or less assuming that the distribution of plaintext messages are equiprobable. What role does the fact that we are dealing with natural languages play in this?