使用哈夫曼树进行文本压缩

2021138

已于 2024-07-23 09:10:30 修改

阅读量1.3k

点赞数 30

文章标签： java 霍夫曼树

于 2024-07-22 18:53:29 首次发布

本文链接：https://blog.csdn.net/weixin_63318285/article/details/140617152

版权

实际开发中经常用到压缩软件，比如WinRAR可以对所有文件进行压缩。这里实现一个文本压缩工具，可以对文本文件进行压缩。

实现思路：构建哈夫曼树进行压缩。对于大文件，单线程压缩效率低下，使用多线程并发压缩，频繁的创建并销毁线程会拖慢系统速度，使用线程池进一步优化

哈夫曼树简介

哈夫曼树是一种最优二叉树

定义：给定n个权值作为n个叶子结点，构造一棵二叉树，若树的带权路径长度达到最小，则这棵树被称为哈夫曼树。

带权路径长度计算如下所示

在这里插入图片描述

18和20到根结点的路径为1，7和11到根结点的路径为2，所以整个树的带全路径长度为7 * 2 + 11 * 2 + 18 * 1 + 20 * 1 = 74

这就是一颗最优二叉树，因为权值大的离根结点最近，路径最短，这样可以确保加权之后综合最小。哈夫曼树就采用这个原理，将权值大的结点（出现次数多的结点）放到离根结点最近的地方，编码后的bit数最小，实现一种压缩效果

原始压缩

代码实现如下

HufNode类是哈夫曼树的结点，除了记录左右子结点，还记录对应的字符

public class HufNode {

    char value;
    HufNode left;
    HufNode right;

    public HufNode(char value) {
        this.value = value;
    }

    public HufNode(char value, HufNode left, HufNode right) {
        this.value = value;
        this.left = left;
        this.right = right;
    }
}

封装TempNode类，聚合HufNode构建哈夫曼树。具体过程如下

统计每个字符出现次数后构建HufNode结点，使用出现次数和HufNode对象构建TempNode，将TempNode放到优先队列中，不断从优先队列中取出两个较小的元素构建新结点放回优先队列中，直到优先队列只有一个元素。由于优先队列默认是小根堆，且按照TempNode中的count属性排序，所以可以保证每次取出的是出现次数最少的两个元素

public class TempNode {

    int count;
    HufNode node; 

    public TempNode(int count, HufNode node) {
        this.count = count;
        this.node = node;
    }

}

哈夫曼树实现如下。

该压缩工具实现了异地压缩，可以将文件发送给别人，在另一台主机上解压缩。因此需要将对应的编码关系一起写入压缩文件。使用一个变量存储对应关系。读取文件时需要将编码后字符串及编码表分隔开，使用非asc码字符"李"进行划分，解码时也使用"李"进行划分

public class HuffmanTree {

    // 原始字符串
    public char[] origin;
    // 编码后的字符串
    public String transfer;
    // 哈夫曼树的根结点
    public HufNode root;
    // 编码对应关系
    public Map<Character, StringBuilder> relation = new HashMap<>();

    public HuffmanTree(Map<Character, StringBuilder> relation) {
        buildTree(relation);
    }

    public HuffmanTree(String s) {
        Map<Character, Integer> map = new HashMap<>();
        char[] arr = s.toCharArray();
        origin = arr;

        // 统计字符出现次数
        for (int i = 0; i < arr.length; i++) {
            map.put(arr[i], map.getOrDefault(arr[i], 0) + 1);
        }

        // 使用双向链表存储对应数据及权重，按照权重排序
        Node head = new Node(), tail = new Node();
        head.next = tail; tail.prev = head;
        for (Character c : map.keySet()) {
            Node newNode = new Node(c, map.get(c));
            insert(newNode, head.next, tail);
        }
        if (head.next == tail) {
            root = null;
            // 所有数据放在叶子结点，即使只有一个元素也不能放在根结点，这里放在左子树
        } else if (head.next.next == tail) {
            root = new HufNode('+');
            root.left = new HufNode(head.next.data);
        } else {
            PriorityQueue<TempNode> queue = new PriorityQueue<>(new Comparator<TempNode>() {
                @Override 
                public int compare(TempNode o1, TempNode o2) {
                    return o1.count - o2.count;
                }
            });
            // 从最小值开始取两个最小值依次构成左右子结点，构造父结点并添加到哈希表中
            // 便于后续该结点作为子结点。删除构造好的两个结点，并将新的数据插入到链表中
            for (Character c : map.keySet()) {
                HufNode node = new HufNode(c);
                queue.add(new TempNode(map.get(c), node));
            }
            while (queue.size() > 1) {
                TempNode left = queue.poll(), right = queue.poll();
                queue.add(new TempNode(left.count + right.count, new HufNode(' ', left.node, right.node)));
            }
            root = queue.poll().node;
        }
    }

    public Map<Character, StringBuilder> encode() {
        encode(root, new StringBuilder(""));
        return relation;
    }

    // 如果node为空，直接返回上一级，这样返回的时候字符串直接更新，不用判断
    // 添加到哈希表的时候注意要新建一个字符串，否则使用递归中的字符串，动态变化
    private void encode(HufNode node, StringBuilder res) {
        if (node == null) return;
        if (node.left == null && node.right == null) relation.put(node.value, new StringBuilder(res));
        encode(node.left, res.append('0'));
        res.delete(res.length() - 1, res.length());
        encode(node.right, res.append('1'));
        res.delete(res.length() - 1, res.length());
    }

    // 先将给定字符串转换为对应的编码，然后7位转asc码存储，最后记录剩余的个数
    // 为了跟之前的普通字符区分开，使用非asc码字符李进行区分
    // 最后一个字符是最后一位的位数
    public StringBuilder transfer() {
        StringBuilder temp = new StringBuilder("");
        for (int i = 0; i < origin.length; i++) {
            temp.append(relation.get(origin[i]));
        }
        StringBuilder res = new StringBuilder("");
        int count = 0, iter = 0, times = 1;
        while (iter + 6 < temp.length()) {
            count = 0; times = 1;
            for (int i = 0; i < 7; i++) {
                count += times * (temp.charAt(iter) - '0');
                iter++; times *= 2;
            }
            res.append((char) count);
        }
        count = 0; times = 1; int spare = temp.length() - iter;
        // 注意：这里正向遍历，但是times从1递增到64，所以相当于反过来的字符串转二进制
        // 因此这里times从1开始即可，不用更新到相应倍数
        while (iter < temp.length()) {
            count += times * (temp.charAt(iter) - '0');
            iter++; times *= 2;
        }
        res.append((char) count);
        res.append("李");
        res.append(spare);
        this.transfer = res.toString();
        return res;
    }

    // 将转换后的字符串恢复成原始字符串
    public StringBuilder decode() {
        char[] arr = transfer.toCharArray();
        StringBuilder temp = new StringBuilder("");
        int iter = 0;

        // 处理分割线之前的字符，转成二进制
        // 之前转换的时候相当于字符串反过来，所以这里要颠倒
        while (arr[iter] != '李') {
            temp.append(char2Bin(arr[iter]).reverse());
            iter++;
        }

        // 删除多余字符
        iter++; temp.delete(temp.length() - 7 + transfer.charAt(iter) - '0',
                temp.length());

        // 遍历哈夫曼树恢复字符串
        arr = temp.toString().toCharArray();
        StringBuilder res = new StringBuilder("");
        iter = 0;
        while (iter < arr.length) {
            HufNode node = root;
            while (!(node.left == null && node.right == null)) {
                if (arr[iter] == '0') node = node.left;
                else node = node.right;
                iter++;
            }
            // 到根结点后取对应数据
            res.append(node.value);
        }
        return res;
    }

    // 恢复的时候将字符转换为对应的二进制串，查找哈夫曼树并恢复
    private StringBuilder char2Bin(int num) {
        StringBuilder res = new StringBuilder("0000000");
        int iter = 6;
        while (num > 0) {
            res.setCharAt(iter, (char) (num % 2 + '0'));
            num = num / 2; iter--;
        }
        return res;
    }

    // 根据关系表恢复出对应的哈夫曼树，用于接受压缩文件后的初始化
    public void buildTree(Map<Character, StringBuilder> relation) {
        root = new HufNode('+');
        for (Character c : relation.keySet()) {
            StringBuilder path = relation.get(c);
            HufNode iter = root;
            for (int i = 0; i < path.length(); i++) {
                if (path.charAt(i) == '0') {
                    if (iter.left == null) iter.left = new HufNode('+');
                    iter = iter.left;
                } else {
                    if (iter.right == null) iter.right = new HufNode('+');
                    iter = iter.right;
                }
            }
            iter.value = c;
        }
    }
}

到这里就可以实现对于文本文件的压缩，中英文都可以，由于要存储编码表，中文字符重复出现次数较少，所以会出现中文文本压缩后比原文件还大的情况，没有想到合理的方法解决，算是一个bug吧~~

举例说明，对"我喜欢中国"进行压缩，得到的文件内容如下（示意），除了存储"我喜欢中国"以外还存储了其他字符串，占用内存比原文件还大。

xxxxxxxxxxxxx李
我0
喜100
欢101
中110
国111

不过对于英文文件压缩效果还是明显，压缩率能达到60%

多线程压缩

单线程压缩效率低下，对于大文件，光读取就需要较长时间，再遍历建树耗时较久。使用多线程优化。

Java中创建多线程有三种方式，由于要返回转换后的字符串，这里使用实现Callable接口的方式创建线程

与上面类似，多个线程并发压缩后写入文件，如何区分不同线程的压缩结果呢，这里使用非Asc码字符"竺"进行分隔，解码也通过"竺"划分

代码如下

package huffman;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.Callable;
import java.util.concurrent.FutureTask;

public class ImproveZip {

    class ZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private RandomAccessFile in;
        private int size;

        // 这里需要显示指明长度，否则不足长度位置补空格，也会写入文件
        public ZipThread(RandomAccessFile in, int size) {
            this.in = in;
            this.size = size;
        }

        @Override
        public StringBuilder call() {
            StringBuilder temp = null;
            try {
                // 读取文件，利用哈夫曼树转换，写出到文件
                byte[] filecontent = new byte[size];
                in.read(filecontent, 0, size);
                String content = new String(filecontent, encoding);
                // 构建哈夫曼树并返回转换后字符串
                tree = new HuffmanTree(content);
                temp = tree.transfer().append('\n');
                Map<Character, StringBuilder> relation = tree.relation;
                for (Character c : relation.keySet()) {
                    // 由于转换的时候0无法判断有几个，在最前面加上一个1用于判别
                    temp.append("" + c + bin2dec(relation.get(c).insert(0, '1'))).append('\n');
                }
                // 可能有多个片段，使用竺分隔
                temp.append("竺");
            } catch (Exception e) {
                e.printStackTrace();
            }
            return temp;
        }
    }

    class UnZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private String content;

        public UnZipThread(String content) {
            this.content = content;
        }

        @Override
        public StringBuilder call() {
            // 转换后字符以"李"结束，分割字符串。可能出现多个"李"，从第一个分开
            // 后面不知道有多少个数字，以第一个换行符为分隔
            int spilt = content.indexOf("李"), end = content.indexOf("\n", spilt + 1);
            // 对应关系的字符数组
            char[] arr = content.substring(end + 1).toCharArray();

            // 恢复出映射关系，构建哈夫曼树
            Map<Character, StringBuilder> relation = new HashMap<>();
            char c; int iter = 0, num = 0;
            while (iter < arr.length - 1) {
                c = arr[iter]; iter++; num = 0;
                while (arr[iter] != '\n') {
                    num = num * 10 + arr[iter] - '0';
                    iter++;
                }
                iter++;
                relation.put(c, dec2bin(num));
            }
            tree = new HuffmanTree(relation);

            // 设置编码后字符串并恢复出原字符串
            tree.transfer = content.substring(0, end);
            return tree.decode();
        }
    }

    private static final int MAX_SIZE = 1000000;
    private static final String encoding = "UTF-8";

    public void zip(String path) {
        RandomAccessFile in = null;
        BufferedWriter out = null; FileWriter writer = null;
        File file = new File(path);

        // 读取文件
        Long filelength = file.length();
        // 按容量划分线程，结果存储到哈希表中，拼串后写入文件
        int len = (int) (filelength / MAX_SIZE);
        Map<Integer, StringBuilder> map = new HashMap<>(len + 1);
        try {
            in = new RandomAccessFile(path, "r");
            // 多线程读取文件建树
            for (int i = 0; i < len + 1; i++) {
                FutureTask<StringBuilder> task = new FutureTask<>(new ZipThread(in, (int) Math.min(MAX_SIZE, filelength)));
                filelength -= MAX_SIZE;
                Thread thread = new Thread(task);
                thread.start();
                map.put(i, task.get());
            }

            // File类用于描述目录或文件，不能改变或读写文件内容
            // RandomAccessFile用于改变读写文件内容。
            File outputFile = new File(path.replace(".txt", "-zip.txt"));
            outputFile.createNewFile();
            writer = new FileWriter(outputFile);
            out = new BufferedWriter(writer);
            for (int i = 0; i < len + 1; i++) {
                String s = map.get(i).toString();
                if (s != null && s.length() != 0) out.write(s);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (out != null) out.close();
                if (in != null) in.close();
            } catch (IOException e) {
                System.out.println("文件流关闭异常");
            }
        }
    }

    public void unzip(String path) {
        RandomAccessFile in = null;
        BufferedWriter out = null; FileWriter writer = null;
        File file = new File(path);
        String encoding = "UTF-8";
        // 读取文件全部内容，以"竺"分隔，多线程解码，最后拼串
        Long filelength = file.length();
        byte[] filecontent = new byte[filelength.intValue()];
        try {
            in = new RandomAccessFile(path, "r");

            // 读取文件全部内容
            in.read(filecontent);
            String content = new String(filecontent, encoding);
            String[] split = content.split("竺");

            // 注意：最后的"竺"并不会多出来一个空串，所以长度恰好是个数
            Map<Integer, StringBuilder> map = new HashMap<>(split.length);
            for (int i = 0; i < split.length; i++) {
                FutureTask<StringBuilder> task = new FutureTask<>(new UnZipThread(split[i]));
                Thread thread = new Thread(task);
                thread.start();
                map.put(i, task.get());
            }

            // 拼串，写出到文件
            File outputFile = new File(path.replace("-zip.txt", "-recover.txt"));
            outputFile.createNewFile();
            writer = new FileWriter(outputFile);
            out = new BufferedWriter(writer);
            // 统一输出，保证输出顺序，没办法多线程同时写文件。
            for (int i = 0; i < split.length; i++) {
                out.write(map.get(i).toString());
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (out != null) out.close();
                if (in != null) in.close();
            } catch (IOException e) {
                System.out.println("文件流关闭异常");
            }
        }
    }

    // 二进制转换为十进制存储
    private int bin2dec(StringBuilder bin) {
        int res = 0, times = 1;
        for (int i = bin.length() - 1; i > -1; i--, times *= 2) {
            res += (bin.charAt(i) - '0') * times;
        }
        return res;
    }

    // 十进制转换为对应的二进制编码
    private StringBuilder dec2bin(int num) {
        StringBuilder res = new StringBuilder();
        while (num > 1) {
            res.append(num % 2);
            num = num / 2;
        }
        res.append(num);
        // 由于存储的时候首位多存了1，这里删除
        // 注意顺序：先删除末尾字符，然后颠倒字符串
        return res.delete(res.length() - 1, res.length()).reverse();
    }
}

线程池优化

使用多线程频繁的创建销毁线程，拖慢系统速度，使用线程池进行优化，代码如下。逻辑思路与前面一模一样，只是将多线程的创建销毁改成了使用线程池

package huffman;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.Callable;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

public class ImproveZip2 {

    class ZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private RandomAccessFile in;
        private int size;

        // 这里需要显示指明长度，否则不足长度位置补空格，也会写入文件
        public ZipThread(RandomAccessFile in) {
            this.in = in;
        }

        public void setSize(int size) {
            this.size = size;
        }

        @Override
        public StringBuilder call() {
            StringBuilder temp = null;
            try {
                // 读取文件，利用哈夫曼树转换，写出到文件
                byte[] filecontent = new byte[size];
                in.read(filecontent, 0, size);
                String content = new String(filecontent, encoding);
                tree = new HuffmanTree(content);
                temp = tree.transfer().append('\n');
                Map<Character, StringBuilder> relation = tree.relation;
                for (Character c : relation.keySet()) {
                    // 由于转换的时候0无法判断有几个，在最前面加上一个1用于判别
                    temp.append("" + c + bin2dec(relation.get(c).insert(0, '1'))).append('\n');
                }
                // 可能有多个片段，使用竺分隔
                temp.append("竺");
            } catch (Exception e) {
                e.printStackTrace();
            }
            return temp;
        }
    }

    class UnZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private String content;

        public UnZipThread() { }

        public void setContent(String content) {
            this.content = content;
        }

        @Override
        public StringBuilder call() {
            // 转换后字符以"李"结束，分割字符串。可能出现多个"李"，这里从第一个分开
            // 后面不知道有多少个数字，以第一个换行符为分隔
            int spilt = content.indexOf("李"), end = content.indexOf("\n", spilt + 1);
            System.out.println("spilt = " + spilt + ", end = " + end);
            // 对应关系的字符数组
            char[] arr = content.substring(end + 1).toCharArray();

            // 恢复出映射关系，构建哈夫曼树
            Map<Character, StringBuilder> relation = new HashMap<>();
            char c; int iter = 0, num = 0;
            while (iter < arr.length - 1) {
                c = arr[iter]; iter++; num = 0;
                while (arr[iter] != '\n') {
                    num = num * 10 + arr[iter] - '0';
                    iter++;
                }
                iter++;
                relation.put(c, dec2bin(num));
            }
            tree = new HuffmanTree(relation);

            // 设置编码后字符串并恢复出原字符串
            tree.transfer = content.substring(0, end);
            return tree.decode();
        }
    }

    private static final int MAX_SIZE = 10000;
    private static final String encoding = "UTF-8";

    private ThreadPoolExecutor zip = new ThreadPoolExecutor(
            20, 50, 2,
            TimeUnit.SECONDS, new ArrayBlockingQueue<>(100),
            new ThreadPoolExecutor.CallerRunsPolicy());
    private ThreadPoolExecutor unZip = new ThreadPoolExecutor(
            20, 50, 2,
            TimeUnit.SECONDS, new ArrayBlockingQueue<>(100),
            new ThreadPoolExecutor.CallerRunsPolicy());

    public void zip(String path) {

        RandomAccessFile in = null;
        BufferedWriter out = null; FileWriter writer = null;
        Long filelength = new File(path).length();
        int len = (int) (filelength / MAX_SIZE);
        Map<Integer, Future<StringBuilder>> map = new HashMap<>(len + 1);
        try {
            in = new RandomAccessFile(path, "r");
            ZipThread thread = new ZipThread(in);
            for (int i = 0; i < len + 1; i++) {
                thread.setSize(Math.min(MAX_SIZE, filelength.intValue()));
                Future<StringBuilder> res = zip.submit(thread);
                filelength -= MAX_SIZE;
                map.put(i, res);
            }
            File outputFile = new File(path.replace(".txt", "-zip.txt"));
            outputFile.createNewFile();
            writer = new FileWriter(outputFile);
            out = new BufferedWriter(writer);
            for (int i = 0; i < len + 1; i++) {
                StringBuilder s = map.get(i).get();
                if (s != null && s.length() != 0) out.write(s.toString());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            zip.shutdown();
            try {
                if (out != null) out.close();
                if (writer != null) writer.close();
                if (in != null) in.close();
            } catch (IOException e) {
                System.out.println("文件流关闭异常");
            }
        }
    }

    public void unzip(String path) {
        RandomAccessFile in = null; BufferedWriter out = null;
        FileWriter writer = null;
        Long filelength = new File(path).length();
        byte[] filecontent = new byte[filelength.intValue()];
        try {
            in = new RandomAccessFile(path, "r");
            in.read(filecontent);
            // 读取文件全部内容
            String content = new String(filecontent, encoding);
            String[] split = content.split("竺");
            Map<Integer, Future<StringBuilder>> map = new HashMap<>(split.length);
            UnZipThread thread = new UnZipThread();
            for (int i = 0; i < split.length; i++) {
                thread.setContent(split[i]);
                Future<StringBuilder> res = unZip.submit(thread);
                map.put(i, res);
            }

            // 统一输出，为了确认顺序，没办法。多路同时写也会有线程安全问题
            File outputFile = new File(path.replace("-zip.txt", "-recover.txt"));
            outputFile.createNewFile();
            writer = new FileWriter(outputFile);
            out = new BufferedWriter(writer);
            for (int i = 0; i < split.length; i++) {
                out.write(map.get(i).get().toString());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            unZip.shutdown();
            try {
                if (out != null) out.close();
                if (writer != null) writer.close();
                if (in != null) in.close();
            } catch (IOException e) {
                System.out.println("文件流关闭异常");
            }
        }
    }

    // 二进制转换为十进制存储
    private int bin2dec(StringBuilder bin) {
        int res = 0, times = 1;
        for (int i = bin.length() - 1; i > -1; i--, times *= 2) {
            res += (bin.charAt(i) - '0') * times;
        }
        return res;
    }

    // 十进制转换为对应的二进制编码
    private StringBuilder dec2bin(int num) {
        StringBuilder res = new StringBuilder();
        while (num > 1) {
            res.append(num % 2);
            num = num / 2;
        }
        res.append(num);
        // 由于存储的时候首位多存了1，这里删除
        // 注意顺序：先删除末尾字符，然后颠倒字符串
        return res.delete(res.length() - 1, res.length()).reverse();
    }
}

代码优化

上述代码为了调试方便，大多数代码直接复制粘贴。后续进行优化，提取出来了许多通用的方法，封装成Utils类，优化后所有类的代码如下所示

Node类

package huffman;

public class Node {
    
    char data;
    int count;
    Node prev;
    Node next;
    
    public Node() { }
    
    public Node(char data, int count) {
        this.data = data;
        this.count = count;
    }
    
    public Node(char data, int count, Node prev, Node next) {
        this.data = data;
        this.count = count;
        this.next = next;
        this.prev = prev;
    }
}

HufNode类

package huffman;

public class HufNode {

    char value;
    HufNode left;
    HufNode right;

    public HufNode(char value) {
        this.value = value;
    }

    public HufNode(char value, HufNode left, HufNode right) {
        this.value = value;
        this.left = left;
        this.right = right;
    }
}

HuffmanTree类

package huffman;

import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

public class HuffmanTree {

    // 原始字符串
    public char[] origin;
    // 哈夫曼树的根结点
    public HufNode root;
    // 编码对应关系
    public Map<Character, StringBuilder> relation = new HashMap<>();

    public Map<Character, StringBuilder> getRelation() {
        return relation;
    }

    // 根据字符串初始化哈夫曼树
    public HuffmanTree(String s) {
        Map<Character, Integer> map = new HashMap<>();
        char[] arr = s.toCharArray();
        this.origin = arr;

        // 统计字符出现次数
        for (int i = 0; i < arr.length; i++) {
            map.put(arr[i], map.getOrDefault(arr[i], 0) + 1);
        }

        if (map.isEmpty()) {
            root = null;
            // 所有数据放在叶子结点，即使只有一个元素也不能放在根结点，这里放在左子树
        } else if (map.size() == 1) {
            root = new HufNode(' ');
            char data = ' ';
            for (Character c : map.keySet()) {
                data = c;
            }
            root.left = new HufNode(data);
        } else {
            PriorityQueue<TempNode> queue = new PriorityQueue<>(new Comparator<TempNode>() {
                @Override 
                public int compare(TempNode o1, TempNode o2) {
                    return o1.count - o2.count;
                }
            });
            // 从最小值开始取两个最小值依次构成左右子结点，构造父结点并添加到哈希表中
            // 便于后续该结点作为子结点。删除构造好的两个结点，并将新的数据插入到链表中
            for (Character c : map.keySet()) {
                HufNode node = new HufNode(c);
                queue.add(new TempNode(map.get(c), node));
            }
            while (queue.size() > 1) {
                TempNode left = queue.poll(), right = queue.poll();
                queue.add(new TempNode(left.count + right.count, new HufNode(' ', left.node, right.node)));
            }
            root = queue.poll().node;
        }
        // 说明：构建哈夫曼树后需要编码，直接写死在构造方法中，防止遗忘
        // 注意大括号，都要编码，不能放在else里面
        encode();
    }

    // 根据关系表恢复出对应的哈夫曼树，用于接受压缩文件后的初始化
    public HuffmanTree(Map<Character, StringBuilder> relation) {
        root = new HufNode('+');
        for (Character c : relation.keySet()) {
            StringBuilder path = relation.get(c);
            HufNode iter = root;
            for (int i = 0; i < path.length(); i++) {
                if (path.charAt(i) == '0') {
                    if (iter.left == null) iter.left = new HufNode('+');
                    iter = iter.left;
                } else {
                    if (iter.right == null) iter.right = new HufNode('+');
                    iter = iter.right;
                }
            }
            iter.value = c;
        }
    }

    // 编码哈夫曼树中的结点
    public Map<Character, StringBuilder> encode() {
        encode(root, new StringBuilder(""));
        return relation;
    }

    // 如果node为空，直接返回上一级，这样返回的时候字符串直接更新，不用判断
    // 添加到哈希表的时候注意要新建一个字符串，否则使用递归中的字符串，动态变化
    private void encode(HufNode node, StringBuilder res) {
        if (node == null) return;
        if (node.left == null && node.right == null) relation.put(node.value, new StringBuilder(res));
        encode(node.left, res.append('0'));
        res.delete(res.length() - 1, res.length());
        encode(node.right, res.append('1'));
        res.delete(res.length() - 1, res.length());
    }

    // 先编码原字符串，然后7位转asc码存储，最后记录最后一个字符用到的位数
    // 为了跟之前的普通字符区分开，使用非asc码字符李进行区分，至于为什么，哈哈哈
    public StringBuilder transfer() {
        StringBuilder temp = new StringBuilder("");
        for (int i = 0; i < origin.length; i++) {
            temp.append(relation.get(origin[i]));
        }
        StringBuilder res = new StringBuilder("");
        int count = 0, iter = 0, times = 1;
        // 确保有7位
        while (iter + 6 < temp.length()) {
            count = 0; times = 1;
            for (int i = 0; i < 7; i++) {
                count += times * (temp.charAt(iter) - '0');
                iter++; times *= 2;
            }
            res.append((char) count);
        }
        // 最后一位可能不足7，单独写
        count = 0; times = 1; int spare = temp.length() - iter;
        // 注意：这里正向遍历，但是times从1递增到64，所以相当于反过来的字符串转二进制
        // 因此这里times从1开始
        while (iter < temp.length()) {
            count += times * (temp.charAt(iter) - '0');
            iter++; times *= 2;
        }
        res.append((char) count);
        res.append("李").append(spare);
        return res;
    }

    // 将转换后的字符串恢复成原始字符串
    public StringBuilder decode(String transfer) {
        char[] arr = transfer.toCharArray();
        StringBuilder temp = new StringBuilder("");
        int iter = 0;

        // 处理分割线之前的字符，转成二进制
        // 之前转换的时候相当于字符串反过来，所以这里要颠倒
        while (arr[iter] != '李') {
            temp.append(Utils.char2Bin(arr[iter]).reverse());
            iter++;
        }

        // 删除多余字符
        iter++; temp.delete(temp.length() - 7 + transfer.charAt(iter) - '0',
                temp.length());

        // 遍历哈夫曼树恢复字符串
        arr = temp.toString().toCharArray();
        StringBuilder res = new StringBuilder("");
        iter = 0;
        while (iter < arr.length) {
            HufNode node = root;
            while (!(node.left == null && node.right == null)) {
                if (arr[iter] == '0') node = node.left;
                else node = node.right;
                iter++;
            }
            // 到根结点后取对应数据
            res.append(node.value);
        }
        return res;
    }

}

Utils工具类

package huffman;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class Utils {

    // 压缩线程，并发读文件，构建哈夫曼树，返回编码结果。分隔位置没有关系，反正都要转换
    // 这里需要显示指明长度，否则不足长度位置补空，也会写入文件
    static class ZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private String path;
        private int size;

        public ZipThread(String path, int size) {
            this.path = path;
            this.size = size;
        }

        @Override
        public StringBuilder call() {

            // 读取文件，构建哈夫曼树
            String content = conReadFileWithLength(path, size);
            tree = new HuffmanTree(content);

            // 依次拼接编码后字符串及编码表
            StringBuilder res = joinInfo(tree);

            // 可能有多个片段，使用竺分隔
            return res.append("竺");
        }
    }

    // 并发解压缩，使用编码后的片段初始化
    static class UnZipThread implements Callable<StringBuilder> {

        private HuffmanTree tree;
        private String content;

        public UnZipThread(String content) {
            this.content = content;
        }

        @Override
        public StringBuilder call() {

            // 转换后字符以"李"结束，分割字符串。可能出现多个"李"，这里从第一个分开
            int spilt = content.indexOf("李"), end = content.indexOf("\n", spilt + 1);

            // 分隔映射关系字符数组
            char[] arr = content.substring(end + 1).toCharArray();

            // 恢复出映射关系，构建哈夫曼树
            tree = new HuffmanTree(initialRelation(arr));

            // 恢复出原字符串
            return tree.decode(content.substring(0, end));
        }
    }

    // 解码的时候将字符转换为对应的二进制串
    // 存储的时候正向存取，越靠后次数越高，还原的时候逆向还原，高位放后面，确保顺序正确
    public static StringBuilder char2Bin(int num) {
        StringBuilder res = new StringBuilder("0000000");
        int iter = 6;
        while (num > 0) {
            res.setCharAt(iter, (char) (num % 2 + '0'));
            num = num / 2; iter--;
        }
        return res;
    }

    // 还原映射关系表时十进制转换为对应的二进制编码
    public static StringBuilder dec2bin(int num) {
        StringBuilder res = new StringBuilder();
        while (num > 1) {
            res.append(num % 2);
            num = num / 2;
        }
        res.append(num);
        // 由于存储的时候首位多存了1，这里删除
        // 注意顺序：先删除末尾字符，然后颠倒字符串
        return res.delete(res.length() - 1, res.length()).reverse();
    }

    // 存储映射关系表，二进制转换为十进制存储
    public static int bin2dec(StringBuilder bin) {
        int res = 0, times = 1;
        for (int i = bin.length() - 1; i > -1; i--, times *= 2) {
            res += (bin.charAt(i) - '0') * times;
        }
        return res;
    }

    // 恢复出对应的映射关系，先转整数，再转二进制，构建哈希表
    public static Map<Character, StringBuilder> initialRelation(char[] arr) {
        Map<Character, StringBuilder> relation = new HashMap<>();
        char c; int iter = 0, num = 0;
        while (iter < arr.length - 1) {
            c = arr[iter]; iter++; num = 0;
            while (arr[iter] != '\n') {
                num = num * 10 + arr[iter] - '0';
                iter++;
            }
            iter++;
            relation.put(c, dec2bin(num));
        }
        return relation;
    }

    // 编码后字符串 + 对应关系。为了节省空间，将对应编码转化为对应的十进制数据
    public static StringBuilder joinInfo(HuffmanTree tree) {
        StringBuilder temp = tree.transfer().append('\n');
        Map<Character, StringBuilder> relation = tree.getRelation();
        for (Character c : relation.keySet()) {
            // 由于转换的时候0无法判断有几个，在最前面加上一个1用于判别
            temp.append("" + c + Utils.bin2dec(relation.get(c).insert(0, '1')) + "\n");
        }
        return temp;
    }

    // 读取文件全部内容
    public static String readAllFile(String path) {
        String encoding = "UTF-8";
        Long filelength = new File(path).length();
        byte[] filecontent = new byte[filelength.intValue()];
        RandomAccessFile in = null; String content = null;
        try {
            in = new RandomAccessFile(path, "r");
            in.read(filecontent);
            content = new String(filecontent, encoding);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (in != null) in.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return content;
    }

    // 并发读取文件指定大小内容
    public static String conReadFileWithLength(String path, int size) {
        String encoding = "UTF-8";
        RandomAccessFile in = null; String content = null;
        try {
            in = new RandomAccessFile(path, "r");
            byte[] filecontent = new byte[size];
            in.read(filecontent, 0, size);
            content = new String(filecontent, encoding);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (in != null) in.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return content;
    }

    // 将指定内容写出到指定路径文件
    // content可以是字符串，直接写出到文件，也可以是哈希表，遍历写出文件
    public static void writeToFile(String path, Object content) {
        BufferedWriter out = null; FileWriter writer = null;
        try {
            File outputFile = new File(path);
            outputFile.createNewFile();
            writer = new FileWriter(outputFile);
            out = new BufferedWriter(writer);
            if (content instanceof String) {
                out.write((String) content);
            } else if (content instanceof HashMap) {
                HashMap map = (HashMap) content;
                for (int i = 0; i < map.size(); i++) {
                    Future f = (Future) map.get(i);
                    String s = f.get().toString();
                    if (s != null && s.length() != 0) out.write(s);
                }
            }
        } catch (IOException | ExecutionException | InterruptedException e) {
            e.printStackTrace();
        } finally {
            try {
                if (out != null) out.close();
                if (writer != null) writer.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

SingleThreadZip类

package huffman;

public class SingleThreadZip {

    private HuffmanTree huffmanTree;

    public void zip(String path) {

        // 读取文件内容，构建哈夫曼树
        String content = Utils.readAllFile(path);
        huffmanTree = new HuffmanTree(content);

        // 编码并写出
        Utils.writeToFile(path.replace(".txt", "-zip.txt"), Utils.joinInfo(huffmanTree).toString());
    }

    public void unzip(String path) {

        // 读取文件
        String content = Utils.readAllFile(path);

        // 转换后字符以"李"结束，分割字符串。可能出现多个"李"，这里从第一个分开
        int spilt = content.indexOf("李"), end = content.indexOf('\n', spilt + 1);

        // 恢复映射关系并构建哈夫曼树
        char[] arr = content.substring(end + 1).toCharArray();
        huffmanTree = new HuffmanTree(Utils.initialRelation(arr));

        // 解码并输出到对应文件
        StringBuilder res =  huffmanTree.decode(content.substring(0, end));
        Utils.writeToFile(path.replace("-zip.txt", "-recover.txt"), res.toString());
    }

}

MultiThreadZip类

package huffman;

import java.io.File;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;

public class MultiThreadZip {

    // 调参：
    private static final int MAX_SIZE = 1000000;

    public void zip(String path) {

        // 按容量划分线程，结果存储到哈希表中，拼串后写入文件
        // 也可以将StringBuilder结果写入哈希表，为了跟线程池统一，将Future写入哈希表
        Long filelength = new File(path).length();
        int len = (int) (filelength / MAX_SIZE);
        Map<Integer, Future> map = new HashMap<>(len + 1);

        // 多线程读取文件建树
        for (int i = 0; i < len + 1; i++) {
            FutureTask<StringBuilder> task = new FutureTask<>(new Utils.ZipThread(path, (int) Math.min(MAX_SIZE, filelength)));
            filelength -= MAX_SIZE;
            Thread thread = new Thread(task);
            thread.start();
            map.put(i, task);
        }

        // 写出到文件
        Utils.writeToFile(path.replace(".txt", "-zip.txt"), map);
    }

    public void unzip(String path) {

        // 读取文件全部内容，以"竺"分隔
        String content = Utils.readAllFile(path);
        String[] split = content.split("竺");

        // 注意：最后的"竺"并不会多出来一个空串，所以长度恰好是个数
        Map<Integer, Future> map = new HashMap<>(split.length);

        // 多线程解码
        for (int i = 0; i < split.length; i++) {
            FutureTask<StringBuilder> task = new FutureTask<>(new Utils.UnZipThread(split[i]));
            Thread thread = new Thread(task);
            thread.start();
            map.put(i, task);
        }

        // 拼串，写出到文件
        Utils.writeToFile(path.replace("-zip.txt", "-recover.txt"), map);
    }

}

ZipWithThreadPool类

package huffman;

import java.io.File;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

public class ZipWithThreadPool {

    // 调参：
    private static final int MAX_SIZE = 1000000;
    private ThreadPoolExecutor zip = new ThreadPoolExecutor(
            20, 50, 2,
            TimeUnit.SECONDS, new ArrayBlockingQueue<>(100),
            new ThreadPoolExecutor.CallerRunsPolicy());
    private ThreadPoolExecutor unZip = new ThreadPoolExecutor(
            20, 50, 2,
            TimeUnit.SECONDS, new ArrayBlockingQueue<>(100),
            new ThreadPoolExecutor.CallerRunsPolicy());

    public void zip(String path) {

        // 按容量划分线程，结果存储到哈希表中，拼串后写入文件
        Long filelength = new File(path).length();
        int len = (int) (filelength / MAX_SIZE);
        Map<Integer, Future<StringBuilder>> map = new HashMap<>(len + 1);

        // 多线程读取文件建树
        for (int i = 0; i < len + 1; i++) {
            Utils.ZipThread thread = new Utils.ZipThread(path, (int) Math.min(MAX_SIZE, filelength));
            Future<StringBuilder> res = zip.submit(thread);
            filelength -= MAX_SIZE;
            map.put(i, res);
        }

        // 写出到文件
        Utils.writeToFile(path.replace(".txt", "-zip.txt"), map);
        zip.shutdown();
    }

    public void unzip(String path) {

        // 读取文件全部内容并以"竺"分隔，分成编码子序列
        String content = Utils.readAllFile(path);
        String[] split = content.split("竺");

        // 注意：最后的"竺"并不会多出来一个空串，所以长度恰好是个数
        // 多线程解码并放到哈希表中，便于最后的结果输出
        Map<Integer, Future<StringBuilder>> map = new HashMap<>(split.length);
        for (int i = 0; i < split.length; i++) {
            Utils.UnZipThread thread = new Utils.UnZipThread(split[i]);
            Future<StringBuilder> res = unZip.submit(thread);
            map.put(i, res);
        }

        // 统一写出到文件，保证输出顺序，没办法。多路同时写也会有线程安全问题
        Utils.writeToFile(path.replace("-zip.txt", "-recover.txt"), map);
        unZip.shutdown();
    }

}

测试代码

public class Test {

    public static void main(String[] args) {
		long start = 0, end = 0;
        String path = "D:\\Files\\test.txt";

        File outputFile = new File(path.replace(".txt", "-zip.txt"));
        if (outputFile.exists()) outputFile.delete();

        outputFile = new File(path.replace(".txt", "-recover.txt"));
        if (outputFile.exists()) outputFile.delete();

        // 普通的压缩
        start = System.currentTimeMillis();
        TXTZip util = new TXTZip();
        util.zip(path);
        util.unzip(path.replace(".txt", "-zip.txt"));
        end = System.currentTimeMillis();
        System.out.println("程序执行时间：" + (end - start) + "ms");

        // 删除压缩以及解压的文件，便于后续创建，否则追加到原文件后面
        outputFile = new File(path.replace(".txt", "-zip.txt"));
        if (outputFile.exists()) outputFile.delete();

        outputFile = new File(path.replace(".txt", "-recover.txt"));
        if (outputFile.exists()) outputFile.delete();

        // 多线程压缩
        start = System.currentTimeMillis();
        MultiThreadZip util2 = new MultiThreadZip();
        util2.zip(path);
        util2.unzip(path.replace(".txt", "-zip.txt"));
        end = System.currentTimeMillis();
        System.out.println("程序执行时间：" + (end - start) + "ms");

        outputFile = new File(path.replace(".txt", "-zip.txt"));
        if (outputFile.exists()) outputFile.delete();

        outputFile = new File(path.replace(".txt", "-recover.txt"));
        if (outputFile.exists()) outputFile.delete();

        // 使用线程池优化
        start = System.currentTimeMillis();
        ZipWithThreadPool util3 = new ZipWithThreadPool();
        util3.zip(path);
        util3.unzip(path.replace(".txt", "-zip.txt"));
        end = System.currentTimeMillis();
        System.out.println("程序执行时间：" + (end - start) + "ms");

    }
}

针对92M文本文件，运行三次的结果如下
在这里插入图片描述

在这里插入图片描述

可以看到，原生压缩算法较慢，使用多线程之后效率加快，使用线程池之后效率进一步提高，但不太明显

2021138

关注

30
点赞
踩
33

收藏

觉得还不错? 一键收藏
0
评论
使用哈夫曼树进行文本压缩

统计每个字符出现次数后构建HufNode结点，使用出现次数和HufNode对象构建TempNode，将TempNode放到优先队列中，不断从优先队列中取出两个较小的元素构建新结点放回优先队列中，直到优先队列只有一个元素。到这里就可以实现对于文本文件的压缩，中英文都可以，由于要存储编码表，中文字符重复出现次数较少，所以会出现中文文本压缩后比原文件还大的情况，没有想到合理的方法解决，算是一个bug吧~~实现思路：构建哈夫曼树进行压缩。使用多线程频繁的创建销毁线程，拖慢系统速度，使用线程池进行优化，代码如下。
复制链接

扫一扫