Priority Queues Applications

Data Compression: Huffman Code

Text Section 13.6.3

Task: encoding strings ("messages") of symbols (digits)

Suppose we want to obtain an optimal set of codes for messages. Each code is a binary string that will be used for transmission of the corresponding message.

At the receiving end the code will be decoded using a decode tree.

A decode tree is a binary tree in which the external nodes represent messages. The binary bits in the code word for a message determine the branching needed at each level of the decode tree to reach the external node.

This represents a decoded tree with

so if I had a message 1011001 this would represent

1011001
M4M3M4M2

To decode efficiently, the goal is to have the lowest possible average message length
     (minimum weighted external path length)

Let P(i) = probability of ith message
      L(i) = length of ith message

Therefore what we want to minimize is P(i) L(i)

Intuitively, the most common message should be the shortest

Notice how no message begins with a 1 except M4

This is the "trick"

What would message M4M4 be?

Suppose we had messages of yes, no, maybe and repeat and consider the following codes:

MessageP(i)codeP(i)L(i)
yes.50001.00
no.4001.80
maybe.0910.18
repeat.0111.02

The P(i) L(i) = 2.00 this is not very low, but the messages are easy to parse

YNYYMRN would be ?

Messages are delimited in a fixed manner.

Now consider:

MessageP(i)codeP(i)L(i)
yes.5001.00
no.4010.80
maybe.09110.27
repeat.01111.03

The P(i) L(i) = 1.60 smaller and still easy to parse.

Consider the same message: YNYYMRN

What is it? Pictorially:

The idea here is to have instantaneous codes. Messages are delimited in a variable manner however no good digit sequence can be the prefix of a longer good one.

For example

Y 0
N 01
M 011
R 010

is not good. Why?

Technique to determine code: (binary version so d = 0,1)

Notice: the two least probable codes will be the longest and they will differ in their final digit

Conclude: the least probable will be of the form ddd...d1 and the second least will be of form ddd...d0

The probability of either of the last two is the probability of their sum, i.e., P(n-1 U n) equals P(n) + P(n-1)

Now for the important part:
If we combine these last two and repeat this process we can build the code.

Specifically, at this time we have

yesP =.50
noP = .40
maybe/repeatP = .10 (.09 +.01)

Do it again (i.e., repeat until one message keeping track of P)

yes	.50		yes	.50		yes   .50		y/n/m/r    1.0
no	.40		no	.40 		n/m/r .50
maybe	.09		m/r 	.10
repeat	.01		

So

yes	.50			yes			yes	0	
no	.40			no	--0 		n/m/r 	1
maybe	.09 ---0		m/r 	--1
repeat	.01 ---1		


is

yes 0
no 10
maybe 110
repeat 111

so what is the message 1000111110?

Probabilities with the same value can go either way - they will result in codes with the same average length.

The most compression (obviously) occurs when probabilities are lopsided.

Consider a file with predominately numeric data and some variables:

_  .30 	        _  .30		_  .30		_   .30		_       .30	023    .40
0  .20		0  .20		0  .20		0   .20		1abc  	.30	 -     .30	
1  .15		1  .15		1  .15		23  .20	    	0 	.20	1abc   .30
2  .10		2  .10		abc.15		1   .15		23 	.20		
3  .10		3  .10		2  .10		abc .15 				
a  .08		a  .08		3  .10
b  .04		bc .07
c  .03
Finally
_1abc      .60      0
023         .40      1

The text notes are on page 534 and discuss the diagram Figure 13.11 for Constructing a Huffman tree.