(Ch. 4, page 238)
Task: encoding strings ("messages") of symbols (digits)
Suppose we want to obtain an optimal set of codes for messages. Each code is a binary string that will be used for transmission of the corresponding message.
At the receiving end the code will be decoded using a decode tree.
A decode tree is a binary tree in which the external nodes represent messages. The binary bits in the code word for a message determine the branching needed at each level of the decode tree to reach the external node.
See Figure 4.14
This represents a decoded tree with
so if I had a message 1011001 this would represent
| 1 | 01 | 1 | 001 |
| M4 | M3 | M4 | M2 |
To decode efficiently, the goal is to have the lowest possible average message length
(minimum weighted external path length)
Let P(i) = probability of ith message
L(i) = length of ith message
Therefore what we want to minimize is
P(i) L(i)
Intuitively, the most common message should be the shortest
Notice how no message begins with a 1 except M4
This is the "trick"
What would message M4M4 be?
Suppose we had messages of yes, no, maybe and repeat and consider the following codes:
| Message | P(i) | code | P(i)L(i) |
| yes | .50 | 00 | 1.00 |
| no | .40 | 01 | .80 |
| maybe | .09 | 10 | .18 |
| repeat | .01 | 11 | .02 |
The
P(i) L(i) = 2.00 this is not very low, but the messages are easy to
parse
YNYYMRN would be ?
Messages are delimited in a fixed manner.
Now consider:
| Message | P(i) | code | P(i)L(i) |
| yes | .50 | 0 | 1.00 |
| no | .40 | 10 | .80 |
| maybe | .09 | 110 | .27 |
| repeat | .01 | 111 | .03 |
The S P(i) L(i) = 1.60 smaller and still easy to parse.
Consider the same message: YNYYMRN
What is it? Pictorally:
The idea here is to have instantaneous codes. Messages are delimited in a variable manner however no good digit sequence can be the prefix of a longer good one.
For example
Y 0
N 01
M 011
R 010
is not good. Why?
Technique to determine code: (binary version so d = 0,1)
Notice: the two least probable codes will be the longest and they will differ in their final digit
Conclude: the least probable will be of the form ddd...d1 and the second least will be of form ddd...d0
The probability of either of the last two is the probability of their sum, i.e., P(n-1 U n) equals P(n) + P(n-1)
Now for the important part:
If we combine these last two and repeat this process wecan build the code.
Specifically, at this time we have
| yes | P =.50 |
| no | P = .40 |
| maybe/repeat | P = .10 (.09 +.01) |
Do it again (i.e., repeat until one message keeping track of P)
yes .50 yes .50 yes .50 y/n/m/r 1.0
no .40 no .40 n/m/r .50
maybe .09 m/r .10
repeat .01
So
yes .50 yes yes 0
no .40 no --0 n/m/r 1
maybe .09 ---0 m/r --1
repeat .01 ---1
is
yes 0
no 10
maybe 110
repeat 111
so what is the message 1000111110?
Probabilities with the same value can go either way - they will result in codes with the same average length.
The most compression (obviously) occurs when probabilities are lopsided.
Consider a file with predominately numeric data and some variables:
_ .30 _ .30 _ .30 _ .30 _ .30 023 .40
0 .20 0 .20 0 .20 0 .20 1abc .30 - .30
1 .15 1 .15 1 .15 23 .20 0 .20 1abc .30
2 .10 2 .10 abc .15 1 .15 23 .20
3 .10 3 .10 2 .10 abc .15
a .08 a .08 3 .10
b .04 bc .07
c .03
-1abc .60 and 023 .40