Home

Huffman Encoding

October 10, 2020

In chapter two of Structure and Interpretation of Computer Programs, an extended set of exercises deals with Huffman encoding, which I’d never encountered before. It’s fascinating, and merits a closer look.

The simplest encodings represent characters as fixed-length sequences of binary data (0s and 1s). Extended ASCII encoding, for example, represents characters as eight-bit sequences: a is 01100001, A is 01000001, b is 01100010, and so on. The advantage of fixed-length sequences is that, as you traverse a longer sequence like 011000010100000101100010, you know exactly where to pause and map a subset of bits to a character. The disadvantage is efficiency: ideally, the most frequently used characters would correspond to very short sequences like 0, resulting in a smaller sequence overall. This alternative approach is called variable-length encoding.

The core problem of variable-length encoding is: how do we decide where to separate long sequences (messages) into short ones (characters)? How do we know if that 0 represents a character in its own right, or just part of another character? One approach is to use a special separator symbol between characters: for Morse code, it’s the pause. But this, too, represents some amount of inefficiency.

Huffman encoding approaches the problem in an elegant way. The crux of the approach is a binary tree whose leaves are the characters to be encoded, each of which has a weight that corresponds to the character’s relative frequency. Each non-leaf node has a weight (the sum of all weights beneath it) and a character list (the set of all characters beneath it). A small, arbitrary tree (optimized for messages where a appears with relative frequency 4, b 1, c 2, and d 2) might look like this:

{a b c d} 9
    /\
   /  \
  /    \
a 4  {b c d} 5
        /\
       /  \
      /    \
    c 2  {b d} 3
           /\
          /  \
         /    \
        b 1   d 2

Now, the cool part: given a character to encode, we simply start at the root of the tree and traverse down it until we hit the leaf node representing that character. As we go, every time we turn left we add 0 to the binary sequence; every time we turn right, 1. Here, we would encode a by turning left once (0); b, by turning right / right / left (110); c, right and then left (10). a gets the shortest encoding (desirable since its weight means it’s the most frequent), and you can see how decreasingly frequent characters end up with longer and longer encodings.

Decoding, too, is straightforward. Given a list of bits and a tree, we start at the root and follow the bits (left for 0, right for 1) until we hit a leaf node. Every time we hit a non-leaf node, we recursively process the remaining bits and the subtree representing the current node; every time we hit a leaf node, we append its character to the message and start processing the remaining bits from (once again) the root of the tree. So, we solve the problem of knowing individual character boundaries implicitly, by following the tree: each time we visit a leaf node and restart from the root, we’ve hit a division between characters.

Note that this makes the tree itself, not the algorithms around encoding and decoding, the key to accessing information. You can’t just start from a string or a binary sequence and generically “use Huffman encoding”; you need a specific instance of a Huffman tree, reflecting a specific alphabet and weighting. (A complete alphabet, too: if a character doesn’t appear in the tree, like e in the example above, it can’t be encoded.)

Now, let’s look at some code!

Leaf nodes are defined (in SICP’s example) as lists tagged with the string leaf. The book emphasizes the role of data abstraction, often via constructors and selectors, and this is a great example. What really characterizes a leaf node is the make-leaf constructor and the leaf?/char-leaf/weight-leaf selectors, which collectively “know about” the underlying list representation. The rest of the application should not, such that if we chose to use something other than a list to represent leaves, only these four functions would change.

(define (make-leaf char weight)
    (list 'leaf char weight))

(define (leaf? object) (eq? (car object) 'leaf))
(define (char-leaf object) (cadr object))
(define (weight-leaf object) (caddr object))

Trees, too, we can represent as lists. (We omit a tree tag, under the assumption that any node that is not a leaf is a tree.) As before, we have a constructor and some selectors.

(define (make-tree left right)
    (list left
          right
          (append (chars left) (chars right))
          (+ (weight left) (weight right))))

(define (left-branch tree) (car tree))
(define (right-branch tree) (cadr tree))
(define (chars-tree tree) (caddr tree))
(define (weight-tree tree) (cadddr tree))

We also need two more generic selectors, which the tree code has already referenced: chars and weight, which operate on either leaf or tree nodes and proxy through to the correct selectors.

(define (chars node)
    (if (leaf? node)
        (list (char-leaf node))
        (chars-tree node)))

(define (weight node)
    (if (leaf? node)
        (weight-leaf node)
        (weight-tree node)))

Let’s jump straight to decoding. It’s pretty straightforward: choose-branch reads the bit and decides whether to go left or right, and recur loops through the bits one by one. Note how, when the (leaf? next-branch) condition is met, we cons (char-leaf next-branch) onto our decoded value and restart recur from the root of the original tree. Otherwise we call recur with the chosen branch as the next tree, walking down toward our character.

(define (decode bits tree)
    (define (choose-branch bit tree)
        (if (= bit 0)
            (left-branch tree)
            (right-branch tree)))
    (define (recur bits subtree)
        (if (null? bits)
            '()
            (let ((next-branch (choose-branch (car bits) subtree)))
                (if (leaf? next-branch)
                    (cons (char-leaf next-branch)
                          (recur (cdr bits) tree))
                    (recur (cdr bits) next-branch)))))
    (recur bits tree))

Encoding, too, is pretty straightforward. encode itself is almost trivial (loop through a list of characters, building the list of binary digits), with encode-char owning the logic around encoding any one character. We start at the tree root and walk down to the leaf, consing 0 or 1 onto our results as we go.

(define (encode message tree)
    (if (null? message)
        '()
        (append (encode-char (car message) tree)
                (encode (cdr message) tree))))

(define (encode-char char tree)
    (if (leaf? tree)
        '()
        (let ((left-subtree (left-branch tree))
              (right-subtree (right-branch tree)))
            (cond ((contains? char (chars left-subtree))
                    (cons 0 (encode-char char left-subtree)))
                  ((contains? char (chars right-subtree))
                    (cons 1 (encode-char char right-subtree)))
                  (else
                    (error "missing character:" char))))))

We’ve covered encoding and decoding. The last piece of the puzzle is generating the tree itself. Our minimal example above, using the constructors we’ve already defined, would look something like this:

(make-tree (make-leaf 'a 4)
           (make-tree (make-leaf 'c 2)
                      (make-tree (make-leaf 'b 1)
                                 (make-leaf 'd 2))))

It’s a straightforward Lisp-y linked list, except it ends not with an empty cell but with two leaves. The important part is how we arrange character weights, with the lowest weights lowest in the tree. We’ll build up that structure out of raw data that represents character frequencies like this: '((a 4) (b 1) (c 2) (d 2)).

(define (generate-huffman-tree pairs)
    (successive-merge (make-leaf-set pairs)))

; This is kind of tricky: we merge the two lightest items (whether leaves or
; trees) over and over, until all that's left is a single tree.
; Note that we assume our nodes are ordered, so the lightest and second-lightest
; nodes are the first and second in the list. In the recursive call, we call
; `make-tree` with the lightest and second-lightest nodes, and then call
; `adjoin-set` to add the resulting tree to the ordered set of nodes.
(define (successive-merge ordered-nodes)
    (let ((lightest-node (car ordered-nodes))
          (heavier-nodes (cdr ordered-nodes)))
        (if (null? heavier-nodes)
            lightest-node ; we're done, this is the final tree!
            (successive-merge (adjoin-set (make-tree lightest-node
                                                     (car heavier-nodes))
                                          (cdr heavier-nodes))))))

; `successive-merge` assumes that its inputs are ordered and that all nodes are
; either leaves or trees. To satisfy that initial condition, `make-leaf-set`
; formats each pair as a proper leaf and calls `adjoin-set` to order the set.
(define (make-leaf-set pairs)
    (if (null? pairs)
        '()
        (let ((pair (car pairs)))
            (let ((char (car pair))
                  (weight (cadr pair)))
                (adjoin-set (make-leaf char weight)
                            (make-leaf-set (cdr pairs)))))))

; Finally, `adjoin-set` adds `node` to `set` in the correct order. Ordering is
; an important part of the process here: each call to `successive-merge` requires
; us to access the lightest and second-lightest nodes, so by keeping the set in
; order, we save work.
(define (adjoin-set node set)
    (cond ((null? set)
            (list node))
          ; Because we know `set` is (recursively) ordered, we only have to
          ; compare `node` against the first item:
          ((< (weight node) (weight (car set)))
            (cons node set))
          (else (cons (car set)
                      (adjoin-set node (cdr set))))))

And with that, we’re ready to go!

(define weighted-alphabet '((a 4) (b 1) (c 2) (d 2)))
(define tree (generate-huffman-tree weighted-alphabet))
(encode '(c a a d) tree) ; => (1 0 0 0 1 1 1)
(decode '(1 0 0 0 1 1 1) tree) ; => (c a a d)

A classic SICP exercise: well founded in the content at hand (lists, sets, and trees), but surprisingly complex and almost unnecessarily cool.