Hashtable Implementation

Problem #227

Tags: scheme data-structures c-1 c-0 special

Who solved this?

No translations... yet

This problem is to be solved in Scheme (tinyscheme interpreter) - so please check the first problem of this genre (if you haven't yet) for more details.

Suppose the stream of words come to us - e.g. as if they are read from some file or web resource - and we want to count occurrences of every word. This is used, for example, to figure out "most important" words of the document (or collection of texts).

We usually want to use Hashtable data structure for this. The Matching Words problem was similar (though simpler). So it is dict in Python and HashMap in Java, and std::unordered_map in C++; in PHP and JavaScript it is built-in into normal "arrays".

The main idea is that Hashtable allows quick (almost immediate) look-up feature - so programmer doesn't need to browse all the collected words to find one of them (which is time-consuming).

Now, some languages historically had not this advanced data structure from beginning. Among them Pascal, plain old C and Scheme.

This gives us a wonderful opportunity: to try implementing Hashtable ourselves!

Problem Statement

To simplify, we shall work with large integers rather than words. It doesn't matter, since hashtable could store anything - one only need to create suitable hash-function. The function should have properties:

E.g. for string it could be the code of the first (or last?) character. For float we can take, for example, sum of two digits around decimal point. For integer the hash could be this value itself.

Now consider the function was-seen? which "remembers" argument on every call (in a variable *seen-list*) and tells if such value have been already passed before:

(define *seen-list* '())

(define (was-seen? x)
    (if (memv x *seen-list*)
        #t
        (begin
            (set! *seen-list* (cons x *seen-list*))
            #f)))

Try to run the definition in emulator and then call it few times, like this:

(was-seen? 13)          ; says #f
(was-seen? 17)          ; says #f
(was-seen? 119)         ; says #f
(was-seen? 13)          ; says #t

To reset the remembered list one may run (set! *seen-list* '()) of course.

Let's improve this function in two ways! Firstly, make it return not true / false but how many times the given value was seen.

Besides this, improve its performance by creating hashtable. One of possible implementations (call it "array of lists") is below:

Read more in Wikipedia, if this sounds too vague (Hash table). Note there is also slightly different approach, which doesn't need lists (rather puts values directly into outer vector) - called Open Addressing - provided that its size is large enough.

For this problem it is not important how exactly you implement hashtable - but it should work fast!

You also can solve this problem building some kind of Tree instead, but supposedly it won't be much easier. Hash-table itself is a kind of tree too, with branches (lists) growing out of common trunk (main vector). I believe here is even somewhat cheating approach, but hopefully no one will be interested to try.

Input / Output

Your code should have improved version of was-seen? function as requested above. It is going to be tested with about 10000 calls and it shouldn't work too long.

Specifically, you may want to check your function in the following ways:

; manually

(was-seen? 13)          ; says 0
(was-seen? 17)          ; says 0
(was-seen? 13)          ; says 1
(was-seen? 119)         ; says 0
(was-seen? 13)          ; says 2

; automatically

(let loop ((n 0))
    (if (> (was-seen? (modulo (* n 17) 10000)) 0)
        (begin
            (for-each display (list "Repeats after " n " steps"))
            (newline))
        (loop (+ 1 n))))

(display (eval-count)) (newline)

Note the last line: it returns the current value of internal operation counter. Interpreter on server-side will break execution when this counter reaches 3 mln. So the function is guaranteed not to pass with simple single list.

You need to login to get test data and submit solution.