Java 数据结构和算法 - 文件压缩

最新推荐文章于 2024-08-15 19:48:27 发布

此心光明-超然

最新推荐文章于 2024-08-15 19:48:27 发布

阅读量459

点赞数 1

分类专栏： java 数据结构和算法文章标签： Java

本文链接：https://blog.csdn.net/weixin_43364172/article/details/84869187

版权

java 同时被 2 个专栏收录

54 篇文章 0 订阅

订阅专栏

数据结构和算法

4 篇文章 0 订阅

订阅专栏

Java 数据结构和算法 - 文件压缩

prefix codes
哈夫曼算法
实现

假设你有一个文件，只包含下列字符：a、e、i、s、t、空格（sp）和换行符（nl）。而且，文件里有10个a，15个e，12个i，3个s，4个t，13个空格和1个换行符。下图所示，可以用157位代表该文件-一共58个字符，每个字符3位。
standard coding

实际的文件可能很大。很多大文件使用的最频繁的字符和最少用的字符通常有很大的差异。例如，很多大的数据文件有很多数字、空格和换行符，但是很少有q和x。
很多情况下，希望减小文件的大小。减少数据的位数叫做压缩，实际上可以分成两个阶段：编码阶段和解码阶段。下面要讨论的办法，可以节省一些大文件的25%的空间，对于一些大的数据文件，甚至能减少50%-60%的空间。
一般策略是允许编码（code）的长度因字符而异-高频使用的字符有短编码。如果所有字符的使用频率差不多，就节省不了多少空间。

prefix codes

前面的二进制编码可以用下面的二叉树表示。字符都保存在叶子节点，从根开始，沿着路径可以找到任何叶子。如果左枝是0,右枝是1,s的编码就是011。如果字符c_i的深度是d_i，出现过f_i次，编码的成本（cost）是∑d_if_i。
original code by a tree

因为nl是唯一的孩子，上图可以有更好的编码。用nl节点替换它的父，得到下图。新的成本是173,还有很大的优化空间。
A slightly better tree

上面的数是完全树-所有的节点要不是叶子，要不就有两个儿子。一个优化的成本有这样的属性。如果字符都放在叶子节点，任何位序列都能被明确地编码。
比如，假设编码串是0100111100010110001000111。上图显示，0和01不是字符编码，而010代表i，所以第一个字符是i。然后011是s，11是nl。剩余的编码是a、sp、t、i、e和nl。
字符编码可以有不同的长度，只要没有一个字符编码是另一个字符编码的前缀，这编码就叫做前缀码。饭过来，如果字符位于非叶子节点，就不能被明确地编码了。
这样，我们的基本问题就是找到最小成本的完全二叉树，其中所有的字符都在叶子上。下图是一个优化。编码只需要146位。通过交换孩子，有多种优化编码。
optimal prefix code tree

哈夫曼算法

编码系统的算法是Huffman在1952年提出的。它通过重复合并树构造优化的前缀码，获得整个树。
假设字符的数量是C。在哈夫曼算法里我们维护一个树的森林。一棵树的weight是叶子次数的总和。C-1次，两棵树，T1和T2,选择最小的weight，任意打破关系，由子树T1和T2形成新树。在算法的开始，有C个单节点的树，在算法结束的时候，得到一个优化的哈夫曼树。
下图是一个初始森林，每棵树的weight显示在根的左上角。
Initial stage

然后合并weight最小的两棵树。新的根是T1。任意选择左节点。新树的weight就是旧树的weight的和。
在这里插入图片描述

现在有六棵树，我们再次选择weight最小的两棵树，T1和t。合并成新树T2,weight是8。
在这里插入图片描述

第三步合并T2和a，增加T3,weight是18。
在这里插入图片描述

现在weight最小的两棵树都是单节点的，i和sp。合并他俩。生成T4。
在这里插入图片描述

然后合并e和T3。
在这里插入图片描述

最后一步
在这里插入图片描述

实现

现在实现哈夫曼编码算法，不做任何重大优化，只想能解决问题。

先定义要使用的常量。我们要维护一个树节点的优先队列（priority queue）。

interface BitUtils {
    public static final int BITS_PER_BYTES = 8;
    public static final int DIFF_BYTES = 256;
    public static final int EOF = 256;
}

除了标准I/O类，我们的程序由其他几个类组成。因为我们需要执行bit-at-a-time I/O，我们要实现位输入和位输出流的包装类。还要写其他类维护字符数量，增加和返回哈夫曼编码树的信息。最后，我们写压缩和解压缩流的包装类。总共写这些类

BitInputStream：包装Inputstream，提供一个bit-at-a-time输入
BitOutputStream：包装Outputstream，提供一个bit-at-a-time输出
CharCounter：维护字符数量
HuffmanTree：操作哈夫曼编码树
HZIPInputStream：解压缩的包装类
HZIPOutputStream：压缩的包装类

位输入和位输出流类

BitInputStream和BitOutputStream类似，都包装了一个流。流的引用保存成一个私有的数据成员。
BitInputStream的每八个readBit能从底层流读一个字节。读取的字节保存在buffer内，bufferPos指示还有多少未使用的buffer。

import java.io.IOException;
import java.io.InputStream;

public class BitInputStream {
    private InputStream in;
    private int buffer;
    private int bufferPos;

    public BitInputStream(InputStream is) {
        in = is;
        bufferPos = BitUtils.BITS_PER_BYTES;
    }

    //Read one bit as a 0 or 1
    public int readBit() throws IOException {
        whether the bits in the buffer have already been used
        if (bufferPos == BitUtils.BITS_PER_BYTES) {
            //get 8 more bits
            buffer = in.read();
            if (buffer == -1)
                return -1;
            //reset the position indicator
            bufferPos = 0;
        }

        return getBit(buffer, bufferPos++);
    }

    //Close underlying stream
    public void close() throws IOException {
        in.close();
    }

    private static int getBit(int pack, int pos) {
        return (pack & (1 << pos)) != 0 ? 1 : 0;
    }

}

BitOutputStream的每八个writeBit能向底层流写一个字节。它提供flush方法是因为一系列地调用writeBit以后，可能有数据还留在buffer内。
当调用writeBit填充buffer以后，或者调用close方法的时候，就调用flush方法。

import java.io.IOException;
import java.io.OutputStream;

public class BitOutputStream {
    private OutputStream out;
    private int buffer;
    private int bufferPos;

    public BitOutputStream(OutputStream os) {
        bufferPos = 0;
        buffer = 0;
        out = os;
    }

    //Write one bit (0 or 1)
    public void writeBit(int val) throws IOException {
        buffer = setBit(buffer, bufferPos++, val);
        if (bufferPos == BitUtils.BITS_PER_BYTES)
            flush();
    }

    //Write array of bits
    public void writeBits(int[] val) throws IOException {
        for (int v : val)
            writeBit(v);
    }

    //Flush buffered bits
    public void flush() throws IOException {
        if (bufferPos == 0)
            return;

        out.write(buffer);
        bufferPos = 0;
        buffer = 0;
    }

    //Close underlying stream
    public void close() throws IOException {
        flush();
        out.close();
    }

    private int setBit(int pack, int pos, int val) {
        if (val == 1)
            pack |= (val << pos);
        return pack;
    }
}

字符计数类

获取一个输入流的字符数。另外，字符数量能被手动设置，以后再获取（认为8位是一个字符）。

public class CharCounter {

    private int[] theCounts = new int[BitUtils.DIFF_BYTES + 1];

    public CharCounter() {
    }

    public CharCounter(InputStream input) throws IOException {
        int ch;
        while ((ch = input.read()) != -1)
            theCounts[ch]++;
    }

    //Return # occurrences of ch
    public int getCount(int ch) {
        return theCounts[ch & 0xff];
    }

    //Set # occurrences of ch
    public void setCount(int ch, int count) {
        theCounts[ch & 0xff] = count;
    }

}

哈夫曼树类

树是节点的集合。每个节点有它的左、右孩子和父的连接。

public class HuffNode implements Comparable<HuffNode> {
    public int value;
    public int weight;

    public int compareTo(HuffNode rhs) {
        return weight - rhs.weight;
    }

    HuffNode left;
    HuffNode right;
    HuffNode parent;

    HuffNode(int v, int w, HuffNode lt, HuffNode rt, HuffNode pt) {
        value = v;
        weight = w;
        left = lt;
        right = rt;
        parent = pt;
    }
}

我们可以通过一个CharCounter对象增加HuffmanTree对象-立刻构造树。也可以不用CharCounter增加HuffmanTree-等调用readEncodingTable的时候，读取字符数，构造树。
HuffmanTree类提供了writeEncodingTable方法，把树写到一个输出流。

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.util.PriorityQueue;

public class HuffmanTree {
    //can be used to initialize the tree nodes
    private CharCounter theCounts;
    //maps each character to the tree node that contains it
    private HuffNode[] theNodes = new HuffNode[BitUtils.DIFF_BYTES + 1];
    //the root node of the tree
    private HuffNode root;

    public static final int ERROR = -3;
    public static final int INCOMPLETE_CODE = -2;
    public static final int END = BitUtils.DIFF_BYTES;

    public HuffmanTree() {
        theCounts = new CharCounter();
        root = null;
    }

    public HuffmanTree(CharCounter cc) {
        theCounts = cc;
        root = null;
        createTree();
    }

    /**
     * Return the code corresponding to character ch.
     * (The parameter is an int to accomodate EOF).
     * If code is not found, return an array of length 0.
     */
    public int[] getCode(int ch) {
        HuffNode current = theNodes[ch];
        if (current == null)
            return null;

        String v = "";
        HuffNode par = current.parent;

        while (par != null) {
            if (par.left == current)
                v = "0" + v;
            else
                v = "1" + v;
            current = current.parent;
            par = current.parent;
        }

        //Codes are represented by an int[]
        //each element is either a 0 or 1
        int[] result = new int[v.length()];
        for (int i = 0; i < result.length; i++)
            result[i] = v.charAt(i) == '0' ? 0 : 1;

        return result;
    }

    /**
     * Get the character corresponding to code.
     */
    public int getChar(String code) {
        HuffNode p = root;
        for (int i = 0; p != null && i < code.length(); i++)
            if (code.charAt(i) == '0')
                p = p.left;
            else
                p = p.right;

        if (p == null)
            return ERROR;

        return p.value;
    }

    /**
     * Writes an encoding table to an output stream.
     * Format is character, count (as bytes).
     * A zero count terminates the encoding table.
     */
    public void writeEncodingTable(DataOutputStream out) throws IOException {
        for (int i = 0; i < BitUtils.DIFF_BYTES; i++) {
            if (theCounts.getCount(i) > 0) {
                out.writeByte(i);
                out.writeInt(theCounts.getCount(i));
            }
        }
        out.writeByte(0);
        out.writeInt(0);
    }

    /**
     * Read the encoding table from an input stream in format
     * given above and then construct the Huffman tree.
     * Stream will then be positioned to read compressed data.
     */
    public void readEncodingTable(DataInputStream in) throws IOException {
        for (int i = 0; i < BitUtils.DIFF_BYTES; i++)
            theCounts.setCount(i, 0);

        int ch;
        int num;

        for (; ; ) {
            ch = in.readByte();
            num = in.readInt();
            if (num == 0)
                break;
            theCounts.setCount(ch, num);
        }

        createTree();
    }

    /**
     * Construct the Huffman coding tree.
     */
    private void createTree() {
        PriorityQueue<HuffNode> pq = new PriorityQueue<HuffNode>();

        for (int i = 0; i < BitUtils.DIFF_BYTES; i++) {
            //at least once
            if (theCounts.getCount(i) > 0) {
                //create a new tree node
                HuffNode newNode = new HuffNode(i, theCounts.getCount(i), null, null, null);
                theNodes[i] = newNode;
                pq.add(newNode);
            }
        }

        //end-of-file symbol
        theNodes[END] = new HuffNode(END, 1, null, null, null);
        pq.add(theNodes[END]);

        while (pq.size() > 1) {
            HuffNode n1 = pq.remove();
            HuffNode n2 = pq.remove();
            HuffNode result = new HuffNode(INCOMPLETE_CODE, n1.weight + n2.weight, n1, n2, null);
            n1.parent = n2.parent = result;
            pq.add(result);
        }

        root = pq.element();
    }
    
}

对于getCode方法，先通过theNodes方法获取保存该字符的节点。如果找不到，返回空引用。否则，我们使用循环，从父节点向上一直到根节点。每一步都用0或者1表示，最后转换成整数数组，返回。
对于getChar方法，我们从根开始，根据编码沿着分支向下，或者返回null，或者返回节点保存的值。
对于读写编码表的方法，我们使用的格式很简单，不一定是最节省空间的。对每个有编码的字符，我们写它（一个字节），然后写该字符的总数（四个字节）。最后写一个’\0’ 字符和一个0总数（这是个特殊信号）。读表的时候，更新读的总数。调用createTree，构造树。
因为节点实现了Comparable接口（基于节点的weight），程序维护了一个树节点的优先队列。然后我们搜索至少出现过一次的字符。136-142行，逐行翻译树构造算法。当我们有两个或者更多的树，就从优先队列抽取两棵树，合并它们，放回优先队列。在循环的结束，优先队列里只留下一棵树，可以退出循环，设置根。
通过createTree方法产生的树，依赖优先队列如何打破关系。这意味着如果程序在两台机器上编译，有可能在一台机器上压缩的文件，到另一台机器上无法解压。想避免这个问题，需要更多的工作。

压缩类

先看HZIPOutputStream类。每次调用write方法，都写到ByteArrayOutputStream。调用close，完成实际的压缩工作。
close方法的第34行，如果我们只使用byte，传给getCode的整数可能和EOF混淆，因为高位被当作符号位。所以使用位掩码。
退出循环的时候，到了文件末尾，所以写end-of-file码。BitOutputStream的close方法会把任何剩余的位flush到文件，所以不再需要调用flush。

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.OutputStream;

public class HZIPOutputStream extends OutputStream {

    private ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
    private DataOutputStream dout;

    public HZIPOutputStream(OutputStream out) throws IOException {
        dout = new DataOutputStream(out);
    }

    public void write(int ch) throws IOException {
        byteOut.write(ch);
    }

    public void close() throws IOException {
        byte[] theInput = byteOut.toByteArray();
        ByteArrayInputStream byteIn = new ByteArrayInputStream(theInput);

        CharCounter countObj = new CharCounter(byteIn);
        byteIn.close();

        HuffmanTree codeTree = new HuffmanTree(countObj);
        codeTree.writeEncodingTable(dout);

        BitOutputStream bout = new BitOutputStream(dout);

        //repeatedly gets a character and writes its code
        for (int i = 0; i < theInput.length; i++)
            bout.writeBits(codeTree.getCode(theInput[i] & (0xff)));
        //end-of-file code
        bout.writeBits(codeTree.getCode(BitUtils.EOF));

        bout.close();
        byteOut.close();
    }

}

然后是HZIPInputStream。

import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;

public class HZIPInputStream extends InputStream {

    private BitInputStream bin;
    private HuffmanTree codeTree;

    public HZIPInputStream(InputStream in) throws IOException {
        DataInputStream din = new DataInputStream(in);

        codeTree = new HuffmanTree();
        codeTree.readEncodingTable(din);

        bin = new BitInputStream(in);
    }

    public int read() throws IOException {
        //the (Huffman) code that we are currently examining
        String bits = "";
        int bit;
        int decode;

        while (true) {
            bit = bin.readBit();
            if (bit == -1)
                throw new IOException("Unexpected EOF");

            //add the bit to the end of the Huffman code
            bits += bit;

            decode = codeTree.getChar(bits);
            if (decode == HuffmanTree.INCOMPLETE_CODE)
                continue;
            else if (decode == HuffmanTree.ERROR)
                //an illegal Huffman code
                throw new IOException("Decoding error");
            else if (decode == HuffmanTree.END)
                //reach the end-of-file code
                return -1;
            else
                //return the character that matches the Huffman code
                return decode;
        }
    }

    public void close() throws IOException {
        bin.close();
    }
}

主程序

参数-c代表压缩，参数-u代表解压。压缩文件名后缀是“.huf”，解压后的文件名后缀是“.uc”。

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
/
public class Hzip {
    public static void compress(String inFile) throws IOException {
        String compressedFile = inFile + ".huf";
        InputStream in = new BufferedInputStream(
                new FileInputStream(inFile));
        OutputStream fout = new BufferedOutputStream(
                new FileOutputStream(compressedFile));
        HZIPOutputStream hzout = new HZIPOutputStream(fout);
        int ch;
        while ((ch = in.read()) != -1)
            hzout.write(ch);
        in.close();
        hzout.close();
    }

    public static void uncompress(String compressedFile) throws IOException {
        String inFile;
        String extension;

        inFile = compressedFile.substring(0, compressedFile.length() - 4);
        extension = compressedFile.substring(compressedFile.length() - 4);

        if (!extension.equals(".huf")) {
            System.out.println("Not a compressed file!");
            return;
        }

        inFile += ".uc";    // for debugging, so as to not clobber original
        InputStream fin = new BufferedInputStream(
                new FileInputStream(compressedFile));
        DataInputStream in = new DataInputStream(fin);
        HZIPInputStream hzin = new HZIPInputStream(in);

        OutputStream fout = new BufferedOutputStream(
                new FileOutputStream(inFile));
        int ch;
        while ((ch = hzin.read()) != -1)
            fout.write(ch);

        hzin.close();
        fout.close();
    }

    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.out.println("Usage: java Hzip -[cu] files");
            return;
        }

        String option = args[0];
        for (int i = 1; i < args.length; i++) {
            String nextFile = args[i];
            if (option.equals("-c"))
                compress(nextFile);
            else if (option.equals("-u"))
                uncompress(nextFile);
            else {
                System.out.println("Usage: java Hzip -[cu] files");
                return;
            }
        }
    }
}