Problem Solutions: Huffman coding problem 138

colin_mcnicholl 2019-06-20 13:12:44

I submitted a solution to this problem. My output was compression ratio:

1.7511167836630503

The input was:

"It's quite too funny. I am sure you could never guess how I "How many? I don't know." "This is a very unexpected turn of affairs," said I; "and what nothing. Presently he emerged, looking even more flurried than steaming horses were in front of the door when I arrived. I paid dress was rich with a richness which would, in England, be looked

The expected answer was reported as:

1.80289093298

I am not sure how to trouble shoot this one. I think it might be associated with the input having double and single quotes within the text which was not something in the simple example. My code does count the frequency of punctuation marks etc. hich I believe is correct.

I enclosed the input in triple single quotes ''' ''' to avoid syntax errors.

I also tried saving input as a txt file and read it as input but that, as expected really, did not change the compression ratio.

Any pointers ?

Rodion (admin) 2019-06-20 16:51:31

Colin, Hi!

I am not sure how to trouble shoot this one

That is really a bit tricky as the problem effectively asks only for length of result. Perhaps it was not the most clever check :(

However, if you try to add decoding to your program it may help checking if the compressed data are not broken. It was the case with mine own solution when I tried writing this algorithm for the first time long ago.

If this still won't help, please, dump the "dictionary" or the tree created by your encoding part - I shall try to do the same with checker and we can compare.

colin_mcnicholl 2019-06-21 08:36:50

OK. Thanks for prompt response. I will have a go at writing the decoding code a bit later.

Some printed output from my encrypt code for the test text on which expected value was 1.80289093298

symbfreq: [(' ', 63), ('e', 30), ('n', 23), ('o', 21), ('r', 19), ('i', 17), ('s', 17), ('a', 15), ('d', 14), ('h', 14) , ('t', 14), ('u', 10), ('w', 10), ('I', 7), ('f', 7), ('l', 7), ('"', 6), ('g', 6), ('c', 5), ('m', 5), ('y', 5), (',', 4), ('.', 4), ('v', 4), ('k', 3), ("'", 2), ('p', 2), (';', 1), ('?', 1), ('E', 1), ('H', 1), ('P', 1), ('T', 1), ('b', 1), ('q', 1), ('x', 1)] codes: {'x': '110110100', 'q': '110110101', 'b': '11000010', 'T': '11000011', 'P': '11000000', 'H': '11000001', 'E': '11 0010110', '?': '110010111', ';': '11001010', 'p': '11011011', "'": '1100100', 'k': '1101100', 'v': '110001', '.': '10110 ', ',': '10111', 'y': '110011', 'm': '111000', 'c': '111001', 'g': '110100', '"': '110101', 'l': '110111', 'f': '111110' , 'I': '111111', 'w': '11101', 'u': '11110', 't': '0000', 'h': '0001', 'd': '0010', 'a': '0011', 's': '0100', 'i': '0101 ', 'r': '0110', 'o': '0111', 'n': '1000', 'e': '1001', ' ': '1010'} compressedsize: 1567; original_size: 2744; length text: 343 1.7511167836630503

Rodion (admin) 2019-06-22 06:43:09

Colin, thanks for dump - I think it helps to nail down the issue!

For me the frequencies array is printed the same, so reading / counting is correct and your code operates with proper data.

Bit-codes seem to be non-overlapping and so there should be no problem with decoding.

However bit-strings themselves seem to be suspicious. Just see, spaces, which are the most frequent, take about 1/5 of all characters. This means the should be ideally encoded with about log2(5)=2.3 bits - i.e. either 2 or 3 bit string. Not 4. On the other hand t is about 4 times more rare, so it shouldn't have the code of the same length. :(

So I propose to check your tree-building part.

colin_mcnicholl 2019-06-22 10:17:47

Hi Rodion,

Thanks for pointers and encouragement. I am sure you are right.

I have found this problem quite difficult. One of difficulties has been when experimenting with in-built python data structures such as heapq or priority queue using the example text 'DAVIDAHUFFMAN' with equal counts for several of the letters when I put these in heapq or PriorityQueue and then came to remove them the node with the lower ascii character value would get removed first as opposed to in the problem example were the end of the queue would be the letter with highest ascii-value. I guess in the end this would not be critical given that it is the length of the code 'bit-string' that matters.

colin_mcnicholl 2019-06-24 08:17:15

Hi Rodion,

Thanks for your help.

Problem solved.

I managed to stay offline, focus on my own code, not look at others and with a little help from the right pages in Cormen, Leiserson, Rivest and Stein Introduction to Algorithms found what I had been getting wrong.

Rodion (admin) 2019-06-24 10:03:04

Colin, Hello!

Glad to see these good news :)

Sorry for I was somewhat distracted by technical issues... So it was about some point of algorithm where we combine and reinsert tree nodes?

I remember when I was implementing this many years ago, I was writing it as a compressor and decompressor utility in C language... And I spent 2 days on a nasty bug - some files were decompressed correctly and some not. As it appeared, I did mistake (just wrong multiplication) in calculating the size for bit sequence arrays and longer sequences were overwriting following ones... I was quite enraged at myself, but nevertheless very pleased that I had overcome this bug.

colin_mcnicholl 2019-06-25 10:32:25

Yes that is correct.

After the event I often end up wondering why I got stuck in the first place because the code in the end often seems fairly simple.