Huffman 编码树
SICP 练习 2.69 - 2.70
编码树的表示
树叶
首先,编码树应当有叶子结点,保存被编码的符号。从根到叶子的路径就是叶子中符号的编码。
可以用(leaf <符号> <权重>)
表示树叶:
; leaf
(define (make-leaf symbol weight)
(list 'leaf symbol weight))
(define (leaf? object)
(eq? (car object) 'leaf))
(define (symbol-leaf x) (cadr x))
(define (weight-leaf x) (caddr x))
树根
因为给符号编码过程相当于从树根开始,找一条(唯一的一条)到该符号树叶的路径,我们准备递归地写程序,所以最好能在根节点判断出该走左边还是右边,这要求根节点持有其左子树与右子树的符号信息。
; tree
(define (make-code-tree left right)
(list left
right
(extend (symbols left) (symbols right))
(+ (weight left) (weight right))))
(define (left-branch tree) (car tree))
(define (right-branch tree) (cadr tree))
编码树的构造需要不断地合并权重最小的子树,因此树根也应该持有子树的权重。这里树叶和树根的表示有所不同,编写获取符号(集)和权重(和)的“多态函数”,会比较方便。
(define (symbols tree)
(if (leaf? tree)
(list (symbol-leaf tree))
(caddr tree)))
(define (weight tree)
(if (leaf? tree)
(weight-leaf tree)
(cadddr tree)))
编码过程需要查看子树的符号集,编码树的构造过程需要查看子树的权重。
解码
假如我们已经有了一棵 Huffman 树。
(define sample-tree
(make-code-tree (make-leaf 'A 4)
(make-code-tree
(make-leaf 'B 2)
(make-code-tree (make-leaf 'D 1)
(make-leaf 'C 1)))))
解码就是将一串01还原成一串符号的过程。是叶子,就加入到最后的结果中,否则继续解码。
(define (decode bits tree)
(define (decode-1 bits current-branch)
(if (null? bits)
'()
(let ((next-branch
(choose-branch (car bits) current-branch)))
(if (leaf? next-branch)
(cons (symbol-leaf next-branch)
(decode-1 (cdr bits) tree))
(decode-1 (cdr bits) next-branch)))))
(decode-1 bits tree))
(define (choose-branch bit branch)
(cond ((= bit 0) (left-branch branch))
((= bit 1) (right-branch branch))
(else (error "bad bit -- CHOOSE-BRANCH" bit))))
构造编码树
构造的起点是一系列(<符号> <频度>)
对。
(define pairs '((A 4) (C 1) (D 1) (B 2)))
按权重排序,
(define (make-leaf-set pairs)
(if (null? pairs)
'()
(let ((pair (car pairs)))
(adjoin-set (make-leaf (car pair)
(cadr pair))
(make-leaf-set (cdr pairs))))))
不停地归并,
(define (generate-huffman-tree pairs)
(successive-merge (make-leaf-set pairs)))
(define (successive-merge leaf-set)
(define (merge rest)
(cond ((null? rest) '())
((null? (cdr rest)) (car rest))
(else (let ((tree1 (car rest))
(tree2 (cadr rest)))
(merge
(adjoin-set
(make-code-tree tree1 tree2)
(cddr rest)))))))
(merge leaf-set))
编码
一个一个地编码,
(define (encode message tree)
(if (null? message)
'()
(extend (encode-symbol (car message) tree)
(encode (cdr message) tree))))
(define (encode-symbol symbol tree)
(define (iter tree result)
(cond ((leaf? tree) (result '()))
((memq? symbol (symbols (left-branch tree)))
(iter (left-branch tree) (lambda (x) (result (cons 0 x)))))
((memq? symbol (symbols (right-branch tree)))
(iter (right-branch tree) (lambda (x) (result (cons 1 x)))))
(else (error "bad symbol -- ENCODE-SYMBOL" symbol))))
(iter tree (lambda (x) x)))
测试
(define sample-song
'(GET A JOB SHA NA NA NA NA NA NA NA NA
GET A JOB SHA NA NA NA NA NA NA NA NA
WAH YIP YIP YIP YIP YIP YIP YIP YIP YIP
SHA BOOM))
(define rock-pairs
'((A 2) (NA 16) (BOOM 1) (SHA 3) (GET 2) (YIP 9) (JOB 2) (WAH 1)))
(define ht
(generate-huffman-tree rock-pairs))
> (length (encode sample-song ht))
84
> (* 3 (length sample-song))
108
可以看到,在频率估计合理时,Huffman 编码可以比定长码短。