Move-To-Front Transform

Problem #230

Tags: data-compression strings popular-algorithm c-1 c-0

Who solved this?

No translations... yet

So we have seen that data compression may work better, if the data are "prepared" in some way. Particularly Burrows-Wheeler Transform keeps the entropy but regroups characters so that LZ and RLE algorithms may work better.

This time let's look at different "transformation" - it also keeps the data size, but changes entropy, so the data are better suited for Huffman or Arithmetic Coding.

Algorithm is simpler and works by substituting symbols of original message using alphabet, which is changing after every step. Suppose we use alphabet consisting of small latin letters and dots (in place of spaces, for visibility). Also we shall use some brief sample phrase to encode. Supposedly it is some tongue-twister in Welsh.

alphabet:   abcdefghijklmnopqrstuvwxyz.
message:    llawr.ar.lawr.y.cawr.enfawr.mae.llawer.o.lewod.lliwiog.yn.llyfu.llaw.gwawr.nawr

Encoding is done as following:

For example, we first find letter "L" at the index 11 of the alphabet; the alphabet itself is updated to:

labcdefghijkmnopqrstuvwxyz.

Next letter is again "L" and it is found at the index 0 (in this case alphabet obvioulsly doesn't change). So continue:

letter      position        updated alphabet
a           1               albcdefghijkmnopqrstuvwxyz.
w           22              walbcdefghijkmnopqrstuvxyz.
r           18              rwalbcdefghijkmnopqstuvxyz.
.           26              .rwalbcdefghijkmnopqstuvxyz
a           3               a.rwlbcdefghijkmnopqstuvxyz
r           2               ra.wlbcdefghijkmnopqstuvxyz
.           2               .rawlbcdefghijkmnopqstuvxyz

And so on. We understand, that letters which tend to happen more often, would generally be found in the beginning of the alphabet, so the output message will have more smaller numbers (though they may correspond to different letters). Result is:

11 0 1 22 18 26 3 2 2 4 3 4 4 4 25 1 7 5 5 5 4 9 17 11
6 6 6 6 17 4 7 3 10 0 3 6 4 6 5 18 1 6 4 5 4 13 5 5 0
16 5 1 5 15 5 14 13 2 7 0 3 13 23 4 4 0 13 10 3 8 2 3 1 13 4 9 4 4 4

However, it is a bit unfair that we convert letters to numbers (it doesn't matter really, as long as both number and letter are bytes) - so let's replace these numbers with letters again, according to (normal) alphabetic order:

labws.dccedeeezbhfffejrlggggrehdkadgegfsbgefenffaqfbfpfonchadnxeeankdicdbnejeee

That way we have transform mapping symbols to the same range of character space. Though note, this last conversion is not specific part of MTF algorithm. Really any "single-way" substitution doesn't matter for compression algorithms.

Note on Entropy

If we apply entropy function to original message and to result, we find with surprise that entropy usually increases after transform. In our case it is 3.45 for original and 3.92 for result.

There are few things about this phenomenon. First is that simple MTF is not necessarily best-suited for plain text (this may even depend on language), as it profits from "local similarities". But it may work good on the result of BWT for example. Second is that there are variants of MTF in which symbols do not immediately jump all way to the front (so that rare symbol disrupts everything), but rather move forward bit by bit. Third is that our entropy formula is, really, not the only way to measure entropy (particularly it doesn't take in account dependencies between symbols).

Problem Statement

That's simple: write decoder for MTF process described above.

Input contains the character sequence to be decoded.

Answer should give the decoded (original) message.

Example:

input:
tig.iffdhdddudddrrcdjjdrhf

answer:
the.fat.cat.sat.on.the.mat
You need to login to get test data and submit solution.