In the problem on Variable Length Code we used some predefined code-table without explaining where it comes from.
Now it is the time to learn how such a table could be created.
Of course we do not want some random distribution of variable-length codes. Instead it is interesting to create the table in which more frequent bytes are encoded with shorter bit sequences.
It is not optimal encoding, but nevertheless it is sometimes used (because it is close to optimal anyway). For example one of the algorithms uzed by ZIP archiver (and some its derivatives) utilizes Shannon-Fano coding. Anyway later you may write the program for more popular Huffman Coding.
Let us see how it works. For example, let the source text consist of the single word
ABRACADABRA. Start with writing
down the count of different characters (or bytes) used:
char byte count A 65 5 B 66 2 C 67 1 D 68 1 R 82 2
Now let us sort them in order of decreasing the counts:
A B R C D 5 2 2 1 1
And split the resulting array into two parts so that total counts in each part are as close to each other, as possible:
A | B R C D 5 | 2 2 1 1
Here total on the left is
5 and total on the right is
Now we say that all characters in the left part should start with a bit
0 while all in the right part should start
with a bit
Since in the left part there remains only one character at all, it will therefore be encoded with a single
the right part contains more characters and we should repeat the operation of splitting it into parts:
B | R C D 2 | 2 1 1
Here the left part has total of
2 and right of
4 - obviously not quite equal, but we could not do better with a
sorted array. So the letters in the left part will have the second bit
0 and ones in the right part will have the
We proceed with splitting the right part twice times more so at last we can construct the Binary Tree looking like this:
| +---0----+----1----+ | | A +---0---+---1---+ | | B +---0---+---1---+ | | R +--0--+--1--+ | | C D
And the corresponding code-table is built by this tree simply traversing its branches and writing down the bits:
char byte code A 65 0 B 66 10 R 82 110 C 67 1110 D 68 1111
Note that tree for word
ABRACADABRA has all left parts consisting of a single letter, but it is only due to fact of
quite unbalanced letter distribution. Another thing is that in the previous problem of "Variable Length Code" we
1 for left branches and
0 for right, but this choice does not make difference.
There are two things which may lead to different code tables being produced for the same source.
Let us agree that if splitting of some sub-array could be done in two ways giving equally close results, like in case
B-R-C-D which could be split either as shown above or like this:
B R | C D 2 2 | 1 1
With the totals
2 instead of
4 - in such cases we will choose the "leftmost" split of two.
Also let us agree that in the initial sorting if two letters have the same count (like
they will take the place in our array in their lexical order (i.e. by increasing of their ASCII codes).
You will be given a fragment of text consisting of letters, spaces, punctuations and probably digits. You are to create the code-table by the rules above.
Input data will contain the text in a single line.
Answer should contain the pairs of ASCII-values and corresponding bit-strings of Shannon-Fano coding. Please output ASCII as decimals while bit-strings using letters
I instead of digits
1 to help us
determine possible mistakes easier. Also output the results in the same order as the letters were sorted during
input data: ABRACADABRA answer: 65 O 66 IO 82 IIO 67 IIIO 68 IIII
Homework: since now you can construct proper Shannon-Fano code table for any arbitrary file (either text or binary) would you like to try to write a Shannon-Fano Compressor which creates such a table and then writes the file using it. Of course resulting file also should contain either the code-table or initial counts of bytes so that decompressor could use the same codes or at least reconstruct them.