数据结构与算法 16 赫夫曼树 赫夫曼编码 数据压缩

赫夫曼树

给定n个权值作为n个叶子节点,构造一颗二叉树,若该树的带权路径长度 weighted path length达到最小,称这样的二叉树为最优二叉树,也称为哈夫曼树(Huffman Tree)

哈夫曼树是带权路径长度最短的树,权值较大的节点离根较近


  • 若规定根节点的层数为1,则从跟节点到第L层节点的路径长度为L-1

  • 节点的带权路径长度:从根节点到该节点之间的路径长度与该节点的权的乘积

  • 树的带权路径长度(WPL):所有叶子节点的带权路径长度之和,WPL最小的就是赫夫曼树


创建思路

  1. 从小到大进行排序,将每一个数据,每个数据都是一个节点,每个节点可以看成是一个最简单的二叉树
  2. 取出根节点权值最小的两颗二叉树
  3. 组成一颗新的二叉树,该新的二叉树的根节点的权值是前面两颗二叉树根节点权值的和
  4. 再将这颗新的二叉树,以根节点的权值大小再次排序,不断重读 1-2-3-4步骤,直到数列中,所有数据都被处理,就得到一颗赫夫曼树

代码

  1. 遍历数组arr的元素以Node形式存到ArrayList(nodes)中

  2. while循环 nodes.size()>1

  3. while内部:

    • 给nodes排序
    • 左节点:第一小 nodes.get(0)
    • 右节点:第二小 nodes.get(1)
    • 父节点:左右节点权值的和
    • 将节点按照关系构建成书
    • 从nodes中移除使用过的左右节点
    • 将父节点的值添加到nodes中
  4. 退出循环时,nodes中只剩下最后一个节点,并且为huffman tree的root,作为返回值 nodes.get(0)

package tree.huffman;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class HuffmanTree {
    public static void main(String[] args) {
        int[] arr = {13,7,8,3,29,6,1};
        Node root = createHuffmanTree(arr);
        root.preOrder();
    }

    // preorder
    public static void preOrder(Node root){
        if(root!=null){
            root.preOrder();
        }else{
            System.out.println("empty");
        }

    }

    public static Node createHuffmanTree(int[] arr){
        // 1. traverse arr
        // 2. construct each element of arr into a node
        // 3. put node into ArrayList
        List<Node> nodes = new ArrayList<Node>();
        for(int value:arr){
            nodes.add(new Node(value));
        }

        // cyclic processing
        // there is only one element left in nodes when out of loop
        while(nodes.size()>1){
            // sort ascending
            // 这里可以实现排序的原因是nodes的元素是Node,Node实现了Comparable接口,能够进行排序
            Collections.sort(nodes);
            // System.out.println(nodes);

            // take out the two binary tree nodes of root with the smallest weight
            // 1. take out the node with smallest weight
            Node leftNode = nodes.get(0);
            // 2. take out the node with the second smallest node
            Node rightNode = nodes.get(1);
            // 3. build a new binary tree
            Node parent = new Node(leftNode.value+rightNode.value);
            parent.left = leftNode;
            parent.right = rightNode;

            // delete binary tree already processed from ArrayList
            nodes.remove(leftNode);
            nodes.remove(rightNode);

            // add parent to nodes
            nodes.add(parent);
        }
        return nodes.get(0); // return the root of Huffman tree
    }
}

// node class
// 为了让Node对象支持排序 Collections集合排序
// 让Node类实现Comparable接口

class Node implements Comparable<Node>{
    int value; // weighted value of node
    Node left; // point to left node
    Node right; // point to right node
    public Node(int value){
        this.value = value;
    }

    // preorder traversal
    public void preOrder(){
        System.out.println(this);
        if(this.left!=null){
            this.left.preOrder();
        }
        if(this.right!=null){
            this.right.preOrder();
        }
    }

    @Override
    public String toString() {
        return "Node{" +
                "value=" + value +
                '}';
    }

    @Override
    public int compareTo(Node o) {
        // 表示从小到大进行排序
        return this.value - o.value;
        // 从大到小排序
        // return -(this.value-o.value)
    }
}

赫夫曼编码

huffman coding

用于数据文件压缩,压缩率通常在20%-90%之间 (file compression)

  • 变长编码:varaible length coding 统计每个字符出现的频次,频次较大的编码为较短的二进制数;容易产生多义问题;比如1010010中,1代表a,10代表b,无法判断第一位出现的1的含义

原理

  1. 拿到字符串

    i like like like java do you like a java
    二进制编码长度为359
    
  2. 统计各个字符出现的次数

    d:1 y:1 u:1 j:2 v:2 o:2 l:4 k:4 e:4 i:5 a:5  :9 
    
  3. 按照字符出现的次数构建一颗赫夫曼树,次数作为权值

  4. 根据赫夫曼树,给各个字符规定编码(前缀编码:任何一个编码都不会是其他编码的前缀

    • 向左路径为0
    • 向右路径为1
    o:1000	u:10010		d:100110	y:100111	i:101
    a:110		k:1110		e:1111		j:0000		v:0001
    l:001		 :01
    
  5. 按照赫夫曼编码,字符串对应的编码为(无损压缩)

    101010011011110111101001101111011110100110111101111....
    
  6. 长度:133,原来的二进制编码长度为359,压缩率为(359-133)/359 = 62.9%;前缀编码,不会造成匹配的多义性

  7. 注意事项:赫夫曼树根据排序方法不同,也可能会不太一样,这样对应的赫夫曼编码也不完全一样


数据压缩

将给定字符串转为赫夫曼树

  1. 创建需要压缩数据 “i like like like java do you like a java” 对应的赫夫曼树

    • Node{data(store date), weight, left, right}

    • 得到 “ i like like like java do you like a java” 对于的 byte[] 数组

    • 编写一个方法getNode(byte[] bytes),将准备构建赫夫曼树的 Node 节点方法 List,形式:[Node[data=97 (ascii) ,weight = 5], Node[data = 32, weight = 9]]…体现:d:1 y:1 u:1 j:2 v:2 o:2 l:4 k:4 e:4 i:5 a:5 :9

    • 通过 List 创建对应的赫夫曼树


生成赫夫曼编码

  1. 赫夫曼树对应的赫夫曼编码

    o:1000	u:10010		d:100110	y:100111	i:101
    a:110		k:1110	...
    
  2. 根据编码生产赫夫曼编码数据

    101010011011110111101001101111011110100110111101111....
    

  • Map<Byte,String> huffmanCodes 存放哈夫曼编码

  • StringBuilder stringBuilder 存放每个叶子节点对应的路径 ”1100“

  • private static void getCodes(Node node,String code,StringBuilder stringBuilder){}

    • StringBuilder2 存放编码(code:“0” / “1”)

    • 如果节点不为空,并且节点的data不为空,则向左右递归getCodes();

    • 如果node.data==null,说明是找到了叶子节点,将编码存入到huffmanCodes中:huffmanCodes.put(node.data,stringBuilder2.toString());

  • private static Map<Byte,String> getCodes(Node root){} 重载方法,简化传入的参数,先判断根节点是否为空,不为空则对于根节点左右子树递归调用getCodes(Node node,String code,StringBuilder stringBuilder)


生成编码结果

private static byte[] zip(byte[] bytes,Map<Byte,String> huffmanCodes){}

代码

package tree.huffmancode;

import java.util.*;

public class HuffmanCodeDemo {
    public static void main(String[] args) {
        String content = "i like like like java do you like a java";
        byte[] contentBytes = content.getBytes();
        byte[] huffmanCodesBytes = huffmanZip(contentBytes);
        System.out.println("after compression:"+Arrays.toString(huffmanCodesBytes));
        /*
        List<Node> nodes = getNode(contentBytes);
        System.out.println(nodes);
        // [Node{data=32, weight=9}, Node{data=97, weight=5}, Node{data=100, weight=1},
        Node huffmanTreeRoot = CreateHuffmanTree(nodes);
        huffmanTreeRoot.preOrder();
        System.out.println("huffman code: ");
        getCodes(huffmanTreeRoot);
        // System.out.println(huffmanCodes);
        byte[] huffmanCodeBytes =  zip(contentBytes,huffmanCodes);
        System.out.println(Arrays.toString(huffmanCodeBytes)); // length:17
         */
    }

    // use a method packaging methods' calls

    /**
     *
     * @param bytes byte array corresponding to the original string (content bytes)
     * @return byte array after processing by huffman code (compression array)
     */
    private static byte[] huffmanZip(byte[] bytes){
        List<Node> nodes = getNode(bytes);
        Node huffmanTreeRoot = CreateHuffmanTree(nodes);
        Map<Byte, String> huffmanCodes = getCodes(huffmanTreeRoot);
        byte[] huffmanCodeBytes =  zip(bytes, HuffmanCodeDemo.huffmanCodes);
        return huffmanCodeBytes;
    }

    // byte[] of string ---> huffman code table ---> huffman code(byte[])

    /**
     *
     * @param bytes The byte array corresponding to the original string
     * @param huffmanCodes Huffman codes map
     * @return byte[] processed by huffman code, byte[] corresponding to "10101000..."
     * 8 bit --> 1 byte, put into huffmanCodeBytes
     * huffmanCodeBytes[0] = 10101000(complement补码) => byte[10101000=>10101000-1=>10100111(Inverted code 补码对应的反码)
     * => 11011000 = -88(original code 反码对应的原码:符号位/第一位不变,其他位取反))
     * huffmanCodeBytes[1] = -88
     */
    private static byte[] zip(byte[] bytes,Map<Byte,String> huffmanCodes){
        // 1. transfer bytes to string corresponding to huffmanCode using huffmanCodes
        StringBuilder stringBuilder = new StringBuilder();
        // traverse bytes
        for(byte b:bytes){
            stringBuilder.append(huffmanCodes.get(b));
        }

        // stringBuilder -> byte[]
        // count length of byte[] huffmanCodeBytes
        int len;
        if(stringBuilder.length() % 8 == 0){
            len = stringBuilder.length()/8;
        } else{
            len = stringBuilder.length() / 8 + 1;
        }
        // one line: int len = (stringBuilder.length()+7)/8;

        // create byte[] store data after compression
        byte[] huffmanCodeBytes = new byte[len];
        int index = 0; // record number of byte
        for(int i = 0;i < stringBuilder.length();i+=8){ // 8 bit->1 byte
            String strByte;
            if(i+8 > stringBuilder.length()){ // not enough 8 bit
                strByte = stringBuilder.substring(i);
            }else{
                strByte = stringBuilder.substring(i,i+8);
                // strByte --> byte array --> put into huffmanCodeBytes
            }
            huffmanCodeBytes[index] = (byte)Integer.parseInt(strByte,2); // string -> binary
            index++;
        }
        return huffmanCodeBytes;
    }



    // huffman coding based on huffman tree
    // 1. huffman coding is stored in Map<Byte,String>, eg: o:1000 u:10010
    static Map<Byte,String> huffmanCodes = new HashMap<Byte,String>();
    // 2. define a StringBuilder storing path of leaf node when creating huffman code
    static StringBuilder stringBuilder = new StringBuilder();

    // for convenience, override getCodes()
    private static Map<Byte,String> getCodes(Node root){
        if(root==null){
            return null;
        }
        getCodes(root.left,"0",stringBuilder);
        getCodes(root.right,"1",stringBuilder);
        return huffmanCodes;
    }

    /**
     * get all leaf nodes' huffman code of given node, putting into huffmanCodes collection
     * @param node given node, root
     * @param code path: left node:0; right node:1;
     * @param stringBuilder splice path
     */
    private static void getCodes(Node node,String code,StringBuilder stringBuilder){
        StringBuilder stringBuilder2 = new StringBuilder(stringBuilder);
        // add code into stringBuilder2
        stringBuilder2.append(code);
        if(node!=null){ // if node==null, pass
            // determine current node is leaf node or non-leaf node
            if(node.data==null){
                // recursion
                // left
                getCodes(node.left,"0",stringBuilder2);
                // right
                getCodes(node.right,"1",stringBuilder2);
            } else{ // leaf node
                huffmanCodes.put(node.data,stringBuilder2.toString());
            }
        }
    }

    private static void preOrder(Node root){
        if(root!=null){
            root.preOrder();
        }else{
            System.out.println("empty");
        }
    }


    /**
     *
     * @param bytes receive byte[]
     * @return List like: [Node[data=97 (ascii) ,weight = 5], Node[data = 32, weight = 9]]
     */
    private static List<Node> getNode(byte[] bytes){
        // 1. create an arraylist
        ArrayList<Node> nodes = new ArrayList<>();
        // 2. traverse bytes, count each byte frequency --> map[key,value]
        HashMap<Byte, Integer> counts = new HashMap<>();
        for(byte b:bytes){
            Integer count = counts.get(b);
            if(count==null){ // no data in map
                counts.put(b,1);
            }else{
                counts.put(b,count+1);
            }
        }
        // 3. transfer each key-value pair to a Node object, add to nodes collection
        for(Map.Entry<Byte,Integer> entry:counts.entrySet()){
            nodes.add(new Node(entry.getKey(),entry.getValue()));
        }
        return nodes;
    }

    // build huffman tree
    private static Node CreateHuffmanTree(List<Node> nodes){
        while(nodes.size()>1){
            Collections.sort(nodes);
            Node leftNode = nodes.get(0);
            Node rightNode = nodes.get(1);
            Node parent = new Node(null,leftNode.weight+rightNode.weight);
            parent.left = leftNode;
            parent.right = rightNode;
            nodes.remove(leftNode);
            nodes.remove(rightNode);
            nodes.add(parent);
        }
        return nodes.get(0);
    }
}

// create Node class, with data and weight
class Node implements Comparable<Node>{
    Byte data; // store data, eg: 'a' --> 97 ' ' --> 32
    int weight; // weight: number of characters|frequency
    Node left;
    Node right;

    public Node(Byte data, int weight) {
        this.data = data;
        this.weight = weight;
    }

    @Override
    public int compareTo(Node o) {
        return this.weight-o.weight; // ascending
    }

    @Override
    public String toString() {
        return "Node{" +
                "data=" + data +
                ", weight=" + weight +
                '}';
    }

    // preOrder
    public void preOrder(){
        System.out.println(this);
        if(this.left!=null){
            this.left.preOrder();
        }
        if(this.right!=null){
            this.right.preOrder();
        }
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值