hash-tables-are-easy.md 8.2 KB

title: Hash tables are easy (in Guile) date: 2015-11-09 18:01 author: Christine Lemmer-Webber tags: hash tables, code, foss, guile, scheme, algorithms, any free license

slug: hash-tables-are-easy

As a programmer, I use hash tables of varying kinds pretty much all day, every day. But one of the odd and embarrassing parts of being a community-trained programmer is that I've never actually implemented one. Eek! Well, today I pulled an algorithms book off the shelf and decided to see how long it would take me to implement their simplest example in Guile. It turns out that it takes less than 25 lines of code to implement a basic hash table with O(1) best time, O(1) average time, and O(n) worst case time. The worst case won't be too common depending on how we size things so this isn't so bad, but we'll get into that as we go along.

Here's the code:

;;; Simple hash table implementation -- (C) 2015 Chris Lemmer-Webber
;;; Released under the "Any Free License 2015-11-05", whose terms are the following:
;;;   This code is released under any of the free software licenses listed on
;;;     https://www.gnu.org/licenses/license-list.html
;;;   which for archival purposes is
;;;     https://web.archive.org/web/20151105070140/http://www.gnu.org/licenses/license-list.html

(use-modules (srfi srfi-1))

(define (make-dumbhash size)
  "Make a dumb hash table: an array of buckets"
  (make-array '() size))

(define* (dumbhash-ref dumbhash key #:optional (default #f))
  "Pull a value out of a dumbhash"
  (let* ((hashed-key (hash key (array-length dumbhash)))
         (bucket (array-ref dumbhash hashed-key)))
    (or (find (lambda (x) (equal? (car x) key))
              bucket)
        default)))

(define (dumbhash-set! dumbhash key val)
  "Set a value in a dumbhash"
  (let* ((hashed-key (hash key (array-length dumbhash)))
         (bucket (array-ref dumbhash hashed-key)))
    ;; Only act if it's not already a member
    (if (not (find (lambda (x) (equal? (car x) key))
                   bucket))
        (array-set! dumbhash
                    ;; extend the bucket with the key-val pair
                    (cons (cons key val) bucket)
                    hashed-key))))

You might even notice that some of these lines are shared between [dumbhash-ref]{.title-ref} and [dumbhash-set!]{.title-ref}, so this could be even shorter. As-is, sans comments and docstrings, it's a mere 17 lines. That's nothing.

We also cheated a little: we're using [hash]{.title-ref} and [equal?]{.title-ref} to generate a hash and to test for equality, which are arguably the hard parts of the job. But these are provided by Guile, and it's one less thing to worry about. Here's a brief demonstration though:

(equal? 'a 'a)               ;; => #t, or true
(equal? 'a 'b)               ;; => #f, or false
(equal? "same" "same")       ;; => #t
(equal? "same" "different")  ;; => #f
(hash "foo" 10)              ;; => 6
(hash 'bar 10)               ;; => 5

[equal?]{.title-ref} is self-explanatory. The important thing to know about [hash]{.title-ref} is that it'll pick a hash value for a [key]{.title-ref} (the first parameter) for a hash table of some [size]{.title-ref} (the second parameter).

So let's jump into an example. [make-dumbhash]{.title-ref} is pretty simple. It just creates an array of whatever [size]{.title-ref} we pass into it. Let's make a simple hash now:

scheme@(guile-user)> (define our-hash (make-dumbhash 8))
scheme@(guile-user)> our-hash
$39 = #(() () () () () () () ())

This literally made an array of 8 items which easy start out with the empty list as its value (that's [nil]{.title-ref} for you common lispers). (You can ignore the [\$39]{.title-ref} part, which may be different when you try this; Guile's REPL lets you refer to previous results at your prompt by number for fast & experimental hacking.)

So our implementation of hash tables is of fixed size, which doesn't limit the number of items we put into it, since buckets can contain multiple values in case of collision (and collisions tend to happen a lot in hash tables, and we come prepared for that), but this does mean we have an existing guess of about how many buckets we need for efficiency. (Resizing hash tables is left as an exercise for the reader.) Our hash table also uses simple linked lists for its buckets, which isn't too uncommon as it turns out.

Let's put something in the hash table. Animal noises are fun, so:

scheme@(guile-user)> (dumbhash-set! our-hash 'monkey 'ooh-ooh)
scheme@(guile-user)> our-hash
$40 = #(() () () ((monkey . ooh-ooh)) () () () ())

The monkey was appended to the third bucket. This makes sense, because the hash of [monkey]{.title-ref} for size 8 is 3:

scheme@(guile-user)> (hash 'monkey 8)
$41 = 3

We can get back the monkey:

scheme@(guile-user)> (dumbhash-ref our-hash 'monkey)
$42 = (monkey . ooh-ooh)

We've set this up so that it returns a pair when we get a result, but if we try to access something that's not there, we get #f instead of a pair, unless we set a default value:

scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon)
$43 = #f
scheme@(guile-user)> (dumbhash-ref our-hash 'chameleon 'not-here-yo)
$44 = not-here-yo

So let's try adding some more things to `our-hash`:

scheme@(guile-user)> (dumbhash-set! our-hash 'cat 'meow)
scheme@(guile-user)> (dumbhash-set! our-hash 'dog 'woof)
scheme@(guile-user)> (dumbhash-set! our-hash 'rat 'squeak)
scheme@(guile-user)> (dumbhash-set! our-hash 'horse 'neigh)
scheme@(guile-user)> ,pp our-hash
$45 = #(()
        ((horse . neigh))
        ()
        ((rat . squeak) (monkey . ooh-ooh))
        ((cat . meow))
        ()
        ((dog . woof))
        ())

([,pp]{.title-ref} is a shortcut to [pretty-print]{.title-ref} something at the REPL, and I've taken the liberty of doing some extra alignment of its output for clarity.)

So we can see we have a collision in here, but it's no problem. Both [rat]{.title-ref} and [monkey]{.title-ref} are in the same bucket, but when we do a lookup of a hashtable in our implementation, we get a list back, and we search to see if that's in there.

We can figure out why this is O(1) average / best time, but O(n) worst time. Assume we made a hash table of the same size as the number of items we put in... assuming our [hash]{.title-ref} procedure gives pretty good distribution, most of these things will end up in an empty bucket, and if they end up colliding with another item (as the rat and monkey did), no big deal, they're in a list. Even though linked lists are of O(n) complexity to traverse, assuming a properly sized hash table, most buckets don't contain any or many items. There's no guarantee of this though... it's entirely possible that we could have a table where all the entries end up in the same bucket. Luckily, given a reasonably sized hash table, this is unlikely. Of course, if we ended up making a hash table that started out with 8 buckets, and then we added 88 entries... collisions are guaranteed in that case. But I already said resizing hash tables is an exercise for the reader. :)

If you're familiar enough with any Scheme (or probably any other Lisp), reading [dumbhash-ref]{.title-ref} and [dumbhash-set!]{.title-ref} should be pretty self-explanatory. If not, go read an introductory Scheme tutorial, and come back! (Relatedly, I think there aren't many good introductory Guile tutorials... I have some ideas though!

What lessons are there to be learned from this post? One might be that Guile is a pretty dang nice hacking environment, which is true! Another might be that it's amazing how far I've gotten in my career without ever writing a hash table, which is also true! But the lesson I'd actually like to convey is: most of these topics are not as unapproachable as they seem. I had a long-time fear that I would never understand such code until I took the time to actually sit down and attempt to write it.

As an additional exercise for the reader, here's a puzzle: is the [Any Free License]{.title-ref} this code released under actually a free license? And what commentary, if any, might the author be making? :)