Burrows-Wheeler Transform

Problem #229

Tags: data-compression popular-algorithm c-1 c-0

Who solved this?

No translations... yet

Among techniques used in data compression, there are interesting algorithms which do not compress data themselves, but transform them so they become better compressible.

One of such algorithms was proposed in 1994 (quite after the popular compression algorithms like Lempel-Ziv or Huffman) and named, as usually, after inventors - Michael Burrows and David Wheeler.

This algorithm rearranges bytes of the input message in such a way, that result has many chains of repeated bytes. At the same time this transformation could be undone (without keeping any additional tables etc).

Let's look at the example (skiping the transformation description itself for now):

original:       tiukka.paikka.peikko.paukku.putki.poikki

transformed:    aoaiukkpppkaoetkkktkuiiiukkkp.....|ukiapi

Here we use dots instead of spaces for better visibility. Also before transformation it is necessary to add some end-of-message character - we use "|" (pipe) for convenience. Original phrase is a "tongue-twister" in Finnish, I believe.

Thus rearranged message has the same "byte content" as original - so it is no help for compression by algorithms like Shannon-Fano or Huffman, which profit from unequal character frequency (e.g. character entropy - it obviously remains the same).

At the same time, algorithms which utilize repeating subsequences in text (e.g. LZ-based or Run-Length encoding) may now work better - even at first glance we see groups of similar letters like kkk and iii.

Algorithm explanation

It is very simple to describe, though could be tough to implement efficiently:

  1. Add "end-of-message" character, distinct from others (it is needed for proper backward transformation later).
  2. Write down all "cyclic shifts" of the message. E.g. for message mamma| it would be amma|m, mma|ma and so on (including original).
  3. Now, sort all those cyclic variations lexicographically (so amma|m becomes first, a|mamm next etc).
  4. Collect last letter of every such variation (in the order we got after sorting). These letters will make transformation result.

Let's complete mamma example:

variations:         after sort:
mamma|              amma|m
amma|m              a|mamm
mma|ma              mamma|
ma|mam              ma|mam
a|mamm              mma|ma
|mamma              |mamma

The last column gives mm|maa - that is the result we want.

Back transformation is also easily described, though perhaps even tougher to implement efficiently. We skip it explanation (feel free to check wikipedia article for details).

Note about Entropy

While the overall character entropy (as we defined it before) remains the same for the whole message (not regarding the end-of-message symbol), it could be found, that parts of the message have smaller entropy now.

For example, if we use our entropy formula on the tongue-twister given above, we get 2.91 for original message and 2.99 after BWT (slight increase because of added end-of-message symbol). But split original message in two (at position 20 - about the middle - for example). Two halves of original have entropies of 2.87 and 2.79, while after splitting transformation result at the same point gives us two substrings with entropies of 2.66 and 2.58.

In practice this means that even algorithms like Huffman can profit from BWT, if they encode two halves (sufficiently long) separately, or if "adaptive" version of algorithm is used (which adapts to distribution changing over time).

Problem Statement

We'll use fragment of the book as an input. It is converted to lowercase for simplisity, with all non-letter characters removed and spaces replaced with dots for visibility. For convenience input is split into several lines, but please concatenate them together before processing. All line feeds are inserted immediately after dots, so it is easy to detect the last line, if needed - it has no dot at the end.

Run BWT over the input, as described above.

Then calculate number of repeating characters in sub-sequences made of the same character. In other words, count every character which is the same as preceding one. E.g. fragment abbbbc contains 3 repeating characters (the first b is not considered "repeating").

As answer, please, print (separated with spaces):

The first value, of course, is the length of input plus 1, so it is just for cross-check.

Example

Input:
tiukka.paikka.peikko.paukku.putki.poikki

Answer:
41 13 6

Explanation: in the result there are six repetitive groups kk, ppp, kkk, iii, kkk, ......

You need to login to get test data and submit solution.