基于哈夫曼树的文本压缩和解压软件

苏陌白195

于 2022-08-21 18:17:36 发布

阅读量980

点赞数 3

文章标签：算法数据结构 java

本文链接：https://blog.csdn.net/weixin_62261676/article/details/126452891

版权

一、哈夫曼树

1.哈夫曼树的建树原理

哈夫曼树是二叉树的一种，给定n个权值作为n个叶子结点，构造一棵二叉树，若该树的带权路径长度达到最小，也被称为最优二叉树。

路径：在一棵树中，从一个结点往下可以达到的结点之间的通路，称为路径。

带权路径长度：若将树中结点赋给一个带有某种含义的数值，则该数值称为该结点的权。从根结点到该结点之间的路径长度与该结点的权的乘积，称为该结点的带权路径长度。比如一个子叶的数据为6，从根节点到这个子叶的路径为2，那么这个子叶的带权路径长度为2x6 = 12。

为了生成哈夫曼树（最优二叉树）便有了哈夫曼树算法：

1、初始状态下共有n个结点，结点的权值分别是给定的n个数，将他们视作n棵只有根结点的树。

2、合并其中根结点权值最小的两棵树，生成这两棵树的父结点，权值为这两个根结点的权值之和，同时移除这两棵子叶树，将它们的父节点和剩余的结点中取出两个最小权值的结点，并合并。

3、重复操作2，直到只剩下一棵树为止，这棵树就是哈夫曼树。

例如：在数组2,3,4,4,5,7中，首先取出2和3两个结点，它们的父节点值为5，此时数组中只剩下了4,4,5,5,7，以此类推就建成了以下的树。

在文本压缩软件当中，建树和排序的源码如下：

public TreeNode createTree(ArrayList<TreeNode> nodeList) {
        //重复下面的步骤：
        while (!nodeList.isEmpty()) {
            //1.排序
            quikSout(nodeList, 0, nodeList.size() - 1);
            TreeNode left;
            TreeNode right;
            //2.取出nodeList中最小的两个值，用这两个节点的data 相加创建新的父节点
            if (!nodeList.isEmpty()) {
                left = nodeList.remove(0);
                left.code = "0";
            } else break;
            if (!nodeList.isEmpty()) {
                right = nodeList.remove(0);
                right.code = "1";
            } else break;

            //3.新创建的父节点//4.子父相连
            TreeNode newNode = new TreeNode(left, right);
            root = newNode;
            root.up = null;
            //5.把父节点添加道nodelist
            nodeList.add(newNode);
        }
        root.code = "";
        return root;
    }
public void quikSout(ArrayList<TreeNode> nodeList, int begin, int end) {
        if (begin > end) {
            return;
        }
        TreeNode tmp = nodeList.get(begin);
        int i = begin;
        int j = end;
        while (i != j) {
            while (nodeList.get(j).data >= tmp.data && j > i) {
                j--;
            }
            while (nodeList.get(i).data <= tmp.data && j > i) {
                i++;
            }
            if (j > i) {
                TreeNode t = nodeList.get(i);
                nodeList.set(i, nodeList.get(j));
                nodeList.set(j, t);
            }
        }
        nodeList.set(begin, nodeList.get(i));
        nodeList.set(i, tmp);
        quikSout(nodeList, begin, i - 1);
        quikSout(nodeList, i + 1, end);
    }

2.哈夫曼编码

对于任意一棵二叉树来说，把二叉树上的所有分支都进行编号，将所有左分支都标记为0，所有右分支都标记为1。

2的编码就为000,3的编码为001，哈夫曼编码在文本压缩软件中也起到了巨大的作用。

在文本压缩软件中为哈夫曼树设置编码的代码如下：

//左0右1
public void setCode(TreeNode root) {
        if (root != null) {
            setCode(root.right);
            setCode(root.left);
            TreeNode curr = root;
            while (root.up != null) {
                str.append(root.code);
                root = root.up;
            }
            str.reverse();
            curr.code = str.toString();
            str.delete(0, str.length());
            if (curr.cdata != null) {
                codeList.put(curr.cdata, curr.code);
            }
        }
    }

二、文本压缩

//TreeNode类
class TreeNode {
    public String cdata;
    public String code;
    public int data;
    public TreeNode left;
    public TreeNode right;
    public TreeNode up;

    public TreeNode() {

    }

    public TreeNode(String cdata, int data) {
        this.data = data;
        this.cdata = cdata;
    }

    public TreeNode(TreeNode left, TreeNode right) {
        this.left = left;
        this.right = right;
        this.data = this.left.data + this.right.data;
        left.up = this;
        right.up = this;
    }
}

1.对文本文件的读取

对文件进行读取需要用到IO流，在IO流当中分为字节流和字符流。

字节流:可以读取任何格式的文件

输入流(InputStream):读文件

输出流(OutputStream):写文件

字符流:只能读取文本文件

输入流(Reader):读文件

输出流(Writer):写文件

字节流主要是处理二进制的数据，字符流是字符集的转化。

在Java中一个字符是由两个字节组成的，所以不管是为了方便还是提高效率，我们选择使用字符流进行对字符的读取。

public String readFile(File file) throws Exception {
        //创建文件字符输入流对象
        FileReader fr = new FileReader(file);
        //缓冲字符流
        BufferedReader br = new BufferedReader(fr);
        //读取一行字符
        StringBuilder msg = new StringBuilder();
        String s = "";
        while ((s = br.readLine()) != null) {
            msg.append(s);
        }
        return msg.toString();
    }

在此处使用StringBuilder的append方法替代字符+运算也是为了提高运行的速度，字符+运算会重新建立对象耗费时间。

2.统计、编码和建树

统计字符串中每个字符出现的频率，我们思考到需要储存字符和频率，正好是一组键值对，所以我们使用哈希表进行储存，通过使用charAt方法进行对字符串进行逐个取出，同时将每个字符的名字储存到ArrayList当中，在我们后面将哈希表存入文件中有所用。

//统计每个字符出现的频率
    public ArrayList<TreeNode> stringBuild(String s) {
        //把各种字符提取出来
        ArrayList<TreeNode> nodeList = new ArrayList<>();

        HashMap<String, Integer> hm = new HashMap<>();
        int i = 0, value = 1;
        while (i < s.length()) {
            String key = s.charAt(i) + "";
            if (hm.containsKey(key)) {
                value = hm.get(key);
                value++;
                hm.put(key, value);
                value = 1;
            } else {
                name.add(key);
                hm.put(key, value);
            }
            i++;
        }
        for (int j = 0; j < name.size(); j++) {
            String key = name.get(j);
            value = hm.get(key);
            TreeNode newNode = new TreeNode(key, value);
            nodeList.add(newNode);
        }
        return nodeList;
    }

    //建树
    public TreeNode createTree(ArrayList<TreeNode> nodeList) {
        //重复下面的步骤：
        while (!nodeList.isEmpty()) {
            //1.排序
            quikSout(nodeList, 0, nodeList.size() - 1);
            TreeNode left;
            TreeNode right;
            //2.取出nodeList中最小的两个值，用这两个节点的data 相加创建新的父节点
            if (!nodeList.isEmpty()) {
                left = nodeList.remove(0);
                left.code = "0";
            } else break;
            if (!nodeList.isEmpty()) {
                right = nodeList.remove(0);
                right.code = "1";
            } else break;

            //3.新创建的父节点//4.子父相连
            TreeNode newNode = new TreeNode(left, right);
            root = newNode;
            root.up = null;
            //5.把父节点添加道nodelist
            nodeList.add(newNode);
        }
        root.code = "";
        return root;
    }

    public void quikSout(ArrayList<TreeNode> nodeList, int begin, int end) {
        if (begin > end) {
            return;
        }
        TreeNode tmp = nodeList.get(begin);
        int i = begin;
        int j = end;
        while (i != j) {
            while (nodeList.get(j).data >= tmp.data && j > i) {
                j--;
            }
            while (nodeList.get(i).data <= tmp.data && j > i) {
                i++;
            }
            if (j > i) {
                TreeNode t = nodeList.get(i);
                nodeList.set(i, nodeList.get(j));
                nodeList.set(j, t);
            }
        }
        nodeList.set(begin, nodeList.get(i));
        nodeList.set(i, tmp);
        quikSout(nodeList, begin, i - 1);
        quikSout(nodeList, i + 1, end);
    }

    //设置编码左0 右1
    StringBuilder str = new StringBuilder();

    public void setCode(TreeNode root) {
        if (root != null) {
            setCode(root.right);
            setCode(root.left);
            TreeNode curr = root;
            while (root.up != null) {
                str.append(root.code);
                root = root.up;
            }
            str.reverse();
            curr.code = str.toString();
            str.delete(0, str.length());
            if (curr.cdata != null) {
                codeList.put(curr.cdata, curr.code);
            }
        }
    }

3.将文本内容转化为01串

此时我们的编码就有了作用，通过哈希表，根据每个字符的编码，把文本文件中的数据转化为相应的编码，组成01串，以便于我们下一步的进行。

public void changeCode(String s) {
        int i = 0;
        StringBuilder str = new StringBuilder();
        while (i < s.length()) {
            String key = s.charAt(i) + "";
            String value = codeList.get(key);
            str.append(value);
            i++;
        }
        code = str.toString();
    }

4.将01串八个一组转换为byte

在我们对01串进行分组时会发现，有时最后一组不足八个01串，此时我们需要将不足八个的01串进行补全（我这里是进行补0），最后再在末尾添加一个byte记录补了几个0或1。

public void bitToByte() {
        StringBuilder str = new StringBuilder();
        //把code编码每八个一组进行分割
        int sum = 0;
        for (int i = 0, j = 8; i < code.length(); i += 8, j += 8) {
            //最后一串编码不足一个byte，补齐8个bit
            if (j > code.length()) {
                int last = code.length();
                str.append(code);
                for (int k = code.length(); k < j; k++) {
                    str.append("0");
                }
                code = str.toString();
                str.delete(0, str.length());
                String sbyte = code.substring(i, j);
                int ibyte = Integer.parseInt(sbyte);
                str.append(ibyte);
                for (int k = str.length() - 1; k >= 0; k--) {
                    if (str.charAt(k) == 49) {
                        sum += Math.pow(2, str.length() - 1 - k);
                    }
                }
                intList.add(sum);
                sum = 0;
                str.delete(0, str.length());
                intList.add(j - last);
                break;
            }
            String sbyte = code.substring(i, j);
            int ibyte = Integer.parseInt(sbyte);
            str.append(ibyte);
            for (int k = str.length() - 1; k >= 0; k--) {
                if (str.charAt(k) == 49) {
                    sum += Math.pow(2, str.length() - 1 - k);
                }
            }
            intList.add(sum);
            sum = 0;
            str.delete(0, str.length());
        }
    }

5.写入文件

我们在压缩过程中不但要将01串写入文件中同时要将编码写入文件当中，考虑到在解压时是由01串转化为字符，所以我们将key和value进行互换，再将编码写入，我们使用对象流输出（ObjectOutputStream），然后再将由01串转化为的byte写入文件，此时我们为了后续对文件的解压考虑，使用字节流输出（FileOutputStream）。

public void changeHash() {
        for (int i = 0; i < name.size(); i++) {
            codeReverse.put(codeList.get(name.get(i)),name.get(i));
        }
    }

 public void writeFile(String path) throws Exception {
        File file = new File(path);
        FileOutputStream fo = new FileOutputStream(file);
        FileWriter fw = new FileWriter(file);
        ObjectOutputStream oo = new ObjectOutputStream(fo);
        changeHash();
        oo.writeObject(codeReverse);
        for (int i = 0; i < intList.size(); i++) {
            fo.write(intList.get(i));
        }
        fo.close();
        fw.close();
    }

三、文本解压

1.读取文件

秉持着怎么写入数据怎么读取数据，我们先使用对象输入流（ObjectInputStream）读取我们key和value已经转换过的编码哈希表，通过强制转换进行获取，再读取我们由01串转化为的十进制数。

public void readFile(File file) throws IOException, ClassNotFoundException {
        FileInputStream fi = new FileInputStream(file);
        StringBuilder str = new StringBuilder();
        ObjectInputStream oi = new ObjectInputStream(fi);
        codeList = (HashMap<String, String>) oi.readObject();
        int read = fi.read();
        int last = 0;
        while (read != -1) {
            last = read;
            str.append(intToByte(read));
            read = fi.read();
        }
        str.delete(str.length() - 8 - last, str.length());
        code = str.toString();
    }

2.将10进制数转化为二进制数

1）把十进数除以2，记下余数，现用商除以2，再记下余数，如此循环，直到商为0。

2）把保存余数的字符串反过来，就是结果。

例如：10转化为二进制

10/2=5余0,5/2=2余1,2/2=1余0,1/2=0余1

此时余数所构成的是0101，不足八位的由0补足，变为01010000，再调换一下就是0001010，即为10的二进制数

public String intToByte(int read) {
        StringBuilder ibyte = new StringBuilder();
        while (read >= 1) {
            ibyte.append(read % 2);
            read /= 2;
        }
        if (ibyte.length() < 8) {
            for (int i = ibyte.length(); i < 8; i++) {
                ibyte.append("0");
            }
        }
        ibyte.reverse();
        return ibyte.toString();
    }

3.将01串转化为真正的编码

读取完01串后我们要将最后八位删去，还有最后八位所表示的数字也要从后向前删去，这样才是我们原本的01串，在读取文件方法中我们进行过了这一点，然后就可以根据哈希表，进行对字符串的复原。

public String byteToReal() {
        StringBuilder str1 = new StringBuilder(code);
        StringBuilder str2 = new StringBuilder();
        StringBuilder real = new StringBuilder();
        int i = 0;
        int size = str1.length() - 1;
        while (size > 0) {
            str2.append(str1.charAt(i));
            if (codeList.get(str2.toString()) != null) {
                real.append(codeList.get(str2.toString()));
                str2.delete(0, str2.length());
            }
            i++;
            size--;
        }
        return real.toString();
    }