JAVA常用压缩算法底层原理&性能比较

最新推荐文章于 2024-05-10 11:30:13 发布

我咋这么优秀呢

最新推荐文章于 2024-05-10 11:30:13 发布

阅读量4.3k

点赞数 5

分类专栏： java基础文章标签： java 压缩 snappy gzip 性能测试

本文链接：https://blog.csdn.net/zj57356498318/article/details/108248602

版权

java基础专栏收录该内容

56 篇文章 0 订阅

订阅专栏

压缩算法

目前常用的几个压缩算法
- GZIP，一个压缩比高的慢速算法，压缩后的数据适合长期使用。 JDK中的java.util.zip.GZIPInputStream / GZIPOutputStream便是这个算法的实现。
- deflate，zip文件用的就是这一算法。与gzip的不同之处在于，你可以指定算法的压缩级别，这样你可以在压缩时间和输出文件大小上进行平衡。可选的级别有0（不压缩），以及1(快速压缩)到9（慢速压缩）。它的实现是java.util.zip.DeflaterOutputStream / InflaterInputStream。
LZ0算法，一个用ANSI C语言编写的实时压缩解压。解压并不需要内存的支持。即使使用非常大的压缩比例进行缓慢压缩出的数据，依然能够非常快速的解压。
- LZ4压缩算法，压缩速度最快的一个，与最快速的defalte相比，它的压缩的结果要略微差一点。
- Snappy，Google开发的一个非常流行的压缩算法，它旨在提供速度与压缩比都相对较优的压缩算法。

GZip算法

底层原理

Gzip使用deflate算法进行压缩。Gzip对于要压缩的文件，首先使用LZ77算法的一个变种进行压缩，对得到的结果再使用Huffman编码的方法（gzip会根据情况选择使用静态Huffman编码或者动态Huffman编码）。所以需要了解LZ77算法和Huffman编码的压缩原理。

LZ77算法

如果文件中有两块内容相同的话，那么只要知道前一块的位置和大小，我们就可以确定后一块的内容。所以我们可以用（两者之间的距离，相同内容的长度）这样一对信息，来替换后一对内容。由于（两者之间的距离，相同内容的长度）这一对信息的大小，小于被替换内容的大小，所以文件得到了压缩。

有一个文件的内容如下
http://jiurl.yeah.net http://jiurl.nease.net

其中有些部分的内容，前面已经出现过了，下面用()括起来的部分就是相同的部分。
http://jiurl.yeah.net (http://jiurl.)nease(.net)

我们使用 (两者之间的距离，相同内容的长度) 这样一对信息，来替换后一块内容。
http://jiurl.yeah.net (22,13)nease(23,4)

(22,13)中，22为相同内容块与当前位置之间的距离，13为相同内容的长度。
(23,4)中，23为相同内容块与当前位置之间的距离，4为相同内容的长度。
由于（两者之间的距离，相同内容的长度）这一对信息的大小，小于被替换内容的大小，所以文件得到了压缩。

LZ77使用滑动窗口的方法来寻找文件中的相同部分，也就是匹配串。
压缩：从文件的开始到文件结束，一个字节一个字节的向后进行处理。用当前处理字节开始的串，和滑动窗口中的每个串进行匹配，寻找最长的匹配串。如果当前处理字节开始的串在窗口中有匹配串，就先输出一个标志位，表明下面是一个(之间的距离，匹配长度) 对，然后输出(之间的距离，匹配长度) 对，然后从刚才处理完的串之后的下一个字节，继续处理。如果当前处理字节开始的串在窗口中没有匹配串，就先输出一个标志位，表明下面是一个没有改动的字节，然后不做改动的输出当前处理字节，然后继续处理当前处理字节的下一个字节。
解压缩：从文件开始到文件结束，每次先读一位标志位，通过这个标志位来判断下面是一个(之间的距离，匹配长度) 对，还是一个没有改动的字节。如果是一个（之间的距离，匹配长度）对，就读出固定位数的（之间的距离，匹配长度）对，然后根据对中的信息，将匹配串输出到当前位置。如果是一个没有改动的字节，就读出一个字节，然后输出这个字节。

Huffman编码

要进行Huffman编码，首先要把整个文件读一遍，在读的过程中，统计每个符号（我们把字节的256种值看作是256种符号）的出现次数。然后根据符号的出现次数，建立Huffman树，通过Huffman树得到每个符号的新的编码。
把所有符号看成是一个结点，并且该结点的值为它的出现次数。进一步把这些结点看成是只有一个结点的树。
每次从所有树中找出值最小的两个树，为这两个树建立一个父结点，然后这两个树和它们的父结点组成一个新的树，这个新的树的值为它的两个子树的值的和。如此往复，直到最后所有的树变成了一棵树。我们就得到了一棵Huffman树。
实例如下
- 一个文件内容：abbbbccccddde,再统计一下各个符号的出现次数
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aGbk7gO5-1598510299650)(./image/huffman-1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2onmM492-1598510299655)(./image/huffman-2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fP7g2RvY-1598510299657)(./image/huffman-3.png)]
最终可以用下面编码表示原字符串：110000110111
压缩：读文件，统计每个符号的出现次数。根据每个符号的出现次数，建立Huffman树，得到每个符号的Huffman编码。将每个符号的出现次数的信息保存在压缩文件中，将文件中的每个符号替换成它的Huffman编码，并输出。
解压缩：得到保存在压缩文件中的，每个符号的出现次数的信息。根据每个符号的出现次数，建立Huffman树，得到每个符号的Huffman编码。将压缩文件中的每个Huffman编码替换成它对应的符号，并输出。

算法实现

	private static final String GZIP_ENCODE_UTF_8 = "UTF-8";

    //GZip解压缩
    public static String gzipUnCompress(String inputString){
        byte[] decode = Base64.getDecoder().decode(inputString);
        return unCompress(decode, GZIP_ENCODE_UTF_8);
    }

    public static String unCompress(byte[] bytes, String encoding){
        if(bytes == null || bytes.length == 0){
            return null;
        }
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ByteArrayInputStream in = new ByteArrayInputStream();
        try{
            GZIPInputStream ungzip = new GZIPInputStream(in);
            byte[] buffer = new byte[256];
            int n;
            while((n = ungzip.read(buffer)) >= 0){
                out.write(buffer, 0, n);
            }
            return out.toString(encoding);
        }catch (Exception e){
            throw new RuntimeException("GzipUnCompressError", e);
        }
    }

    //Gzip压缩
    public static String gzipCompress(String original){
        return Base64.getEncoder().encodeToString(compress(original, GZIP_ENCODE_UTF_8));
    }

    public static byte[] compress(String str, String encoding){
        if(str == null || str.length() == 0){
            return null;
        }
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        GZIPOutputStream gzip ;
        try{
            gzip = new GZIPOutputStream(out);
            gzip.write(str.getBytes(encoding));
            gzip.close();
        }catch (Exception e){
            throw new RuntimeException("GzipCompressError", e);
        }
        return out.toByteArray();
    }

Deflate算法

deflate是zip压缩文件的默认算法. 其实deflate现在不光用在zip文件中, 在7z, xz等其他的压缩文件中都用. 实际上deflate只是一种压缩数据流的算法. 任何需要流式压缩的地方都可以用.
deflate算法就是基于LZ77算法和Huffman编码基础上实现的

算法实现

	//deflate解压缩
    public static String deflateUnCompress(String inputString){
        byte[] bytes = Base64.getDecoder().decode(inputString);
        if(bytes == null || bytes.length == 0){
            return null;
        }
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ByteArrayInputStream in = new ByteArrayInputStream(bytes);
        try{
            InflaterInputStream inflater = new InflaterInputStream(in);
            byte[] buffer = new byte[256];
            int n;
            while((n = inflater.read(buffer)) >= 0){
                out.write(buffer, 0, n);
            }
            return out.toString("utf-8");
        }catch (Exception e){
            throw new RuntimeException("DeflaterUnCompressError", e);
        }
    }

    //deflate压缩
    public static String deflateCompress(String original){
        if(original == null || original.length() == 0){
            return null;
        }
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DeflaterOutputStream deflater ;
        try{
            deflater = new DeflaterOutputStream(out);
            deflater.write(original.getBytes(StandardCharsets.UTF_8));
            deflater.close();
            return Base64.getEncoder().encodeToString(out.toByteArray());
        }catch (Exception e){
            throw new RuntimeException("DeflaterCompressError", e);
        }
    }

Java中DeflateOutputStream和GZIPOutputStream区别

DeflaterOutputStream实现原始deflate压缩方法。GZIPOutputStream为GZIP添加了额外的逻辑： CRC-32 检查，GZIP幻数，GZIP头，预告片等

lzo算法

算法原理

LZO是块压缩算法，它压缩和解压一个块数据。压缩和解压的块大小必须一样。
LZO将块数据压缩成匹配数据（滑动字典）和非匹配的文字序列。LZO对于长匹配和长文字序列有专门的处理，这样对于高冗余的数据能够获得很好的效果，这样对于不可压缩的数据，也能得到较好的效果。
底层使用的也是LZ77算法。

算法实现

	//lzo解压缩
    public static String lzoUnCompress(String str){
        LzoDecompressor decompressor = LzoLibrary.getInstance()
                .newDecompressor(LzoAlgorithm.LZO1X, null);
        try{
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            ByteArrayInputStream is = new ByteArrayInputStream(Base64.getDecoder().decode(str.getBytes(StandardCharsets.UTF_8)));
            LzoInputStream lis = new LzoInputStream(is, decompressor);
            int count;
            byte[] buffer = new byte[256];
            while((count = lis.read(buffer)) != -1){
                os.write(buffer, 0, count);
            }
            return os.toString();
        }catch (Exception e){
            throw new RuntimeException("lzoUnCompressError", e);
        }
    }

    public static String lzoCompress(String str){
        LzoCompressor compressor = LzoLibrary.getInstance().newCompressor(
                LzoAlgorithm.LZO1X, null);
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        LzoOutputStream louts = new LzoOutputStream(os, compressor);
        try{
            louts.write(str.getBytes(StandardCharsets.UTF_8));
            louts.close();
            return Base64.getEncoder().encodeToString(os.toByteArray());
        }catch (Exception e){
            throw new RuntimeException("LzoCompressError", e);
        }
    }

lz4算法

压缩比较差

算法实现

	//lz4解压缩
    public static String lz4UnCompress(String str){
        byte[] decode = Base64.getDecoder().decode(str.getBytes());
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try{
            LZ4BlockInputStream lzis = new LZ4BlockInputStream(
                    new ByteArrayInputStream(decode));
            int count;
            byte[] buffer = new byte[2048];
            while ((count = lzis.read(buffer)) != -1) {
                baos.write(buffer, 0, count);
            }
            lzis.close();
            return baos.toString("utf-8");
        }catch (Exception e){
            throw new RuntimeException("lz4UnCompressError", e);
        }
    }

    //lz4压缩
    public static String lz4Compress(String str){
        LZ4Factory factory = LZ4Factory.fastestInstance();
        ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
        LZ4Compressor compressor = factory.fastCompressor();
        try{
            LZ4BlockOutputStream compressedOutput = new LZ4BlockOutputStream(
                    byteOutput, 2048, compressor);
            compressedOutput.write(str.getBytes(StandardCharsets.UTF_8));
            compressedOutput.close();
            return Base64.getEncoder().encodeToString(byteOutput.toByteArray());
        }catch (Exception e){
            throw new RuntimeException("lz4CompressError", e);
        }
    }

snappy算法

从后面的性能测试上可以得出初步结论，snappy的压缩解压性能是非常优秀的，压缩比也较好。
snappy是google基于LZ77的思路编写的快速数据压缩与解压程序库。它的目标并非最大压缩率或与其他压缩程序库的兼容性，而是非常高的速度和合理的压缩率。

算法实现

	//snappy解压缩
    public static String snappyUnCompress(String str){
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try{
            byte[] decode = Base64.getDecoder().decode(str.getBytes());
            baos.write(Snappy.uncompress(decode));
            return baos.toString();
        }catch (Exception e){
            throw new RuntimeException("snappyUnCompressError", e);
        }
    }

    //snappy压缩
    public static String snappyCompress(String str){
        try{
            byte[] compress = Snappy.compress(str);
            return Base64.getEncoder().encodeToString(compress);
        }catch (Exception e){
            throw new RuntimeException("snappyCompressError", e);
        }
    }

上述实现都验证过解压缩结果正确。

算法性能测试

jdk 1.8.0_144
maven相关依赖

		<dependency>
            <groupId>org.xerial.snappy</groupId>
            <artifactId>snappy-java</artifactId>
            <version>1.1.2.6</version>
        </dependency>
        <dependency>
            <groupId>net.jpountz.lz4</groupId>
            <artifactId>lz4</artifactId>
            <version>1.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.anarres.lzo</groupId>
            <artifactId>lzo-core</artifactId>
            <version>1.0.5</version>
            <exclusions>
                <exclusion>
                    <artifactId>commons-logging</artifactId>
                    <groupId>commons-logging</groupId>
                </exclusion>
            </exclusions>
        </dependency>

代码如下

	File ctqRequest = new File("D:\\Users\\j_zhangn\\Desktop\\ctqIsdRequest.properties");

        StringBuffer sb = new StringBuffer();
        try {
            FileInputStream fin = new FileInputStream(ctqRequest);
            InputStreamReader reader = new InputStreamReader(fin, "UTF-8");
            while (reader.ready()) {
                sb.append((char) reader.read());
            }
            fin.close();
        } catch (IOException ex) {
            ex.printStackTrace();
        } finally {
        }
        //1029265
        System.out.println("originalLength:" + sb.toString().length());

        /**
         * 194292
         * 48.23
         */
        long start = System.currentTimeMillis();
        for(int i = 0;i < 100;i++){
            String gzipCompress = ZipUtil.gzipCompress(sb.toString());
            ZipUtil.gzipUnCompress(gzipCompress);
            System.out.println("gzipLength:" + gzipCompress.length());
        }
        System.out.println("gzipCostAverageTime:" + (System.currentTimeMillis() - start) / 100.0);

        /**
         * 194276
         * 31.46
         */
        start = System.currentTimeMillis();
        for(int i = 0;i < 100;i++) {
            String deflateCompress = ZipUtil.deflateCompress(sb.toString());
            ZipUtil.deflateUnCompress(deflateCompress);
            System.out.println("deflateLength:" + deflateCompress.length());
        }
        System.out.println("deflateCostAverageTime:" + (System.currentTimeMillis() - start) / 100.0);

        /**
         * 391508
         * 20.0
         */
        start = System.currentTimeMillis();
        for(int i = 0;i < 100;i++) {
            String lzoCompress = ZipUtil.lzoCompress(sb.toString());
            ZipUtil.lzoUnCompress(lzoCompress);
            System.out.println("lzoLength:" + lzoCompress.length());
        }
        System.out.println("lzoCostAverageTime:" + (System.currentTimeMillis() - start) / 100.0);

        /**
         * 977604
         * 20.01
         */
        start = System.currentTimeMillis();
        for(int i = 0;i < 100;i++) {
            String lzo4Compress = ZipUtil.lz4Compress(sb.toString());
            ZipUtil.lz4UnCompress(lzo4Compress);
            System.out.println("lz4Length:" + lzo4Compress.length());
        }
        System.out.println("lz4CostAverageTime:" + (System.currentTimeMillis() - start) / 100.0);

        /**
         * 360388
         * 10.74
         */
        start = System.currentTimeMillis();
        for(int i = 0;i < 100;i++) {
            String snappyCompress = ZipUtil.snappyCompress(sb.toString());
            ZipUtil.snappyUnCompress(snappyCompress);
            System.out.println("snappyLength:" + snappyCompress.length());
        }
        System.out.println("snappyCostAverageTime:" + (System.currentTimeMillis() - start) / 100.0);

我咋这么优秀呢

关注

5
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
JAVA常用压缩算法底层原理&性能比较

压缩算法目前常用的几个压缩算法GZIP，一个压缩比高的慢速算法，压缩后的数据适合长期使用。 JDK中的java.util.zip.GZIPInputStream / GZIPOutputStream便是这个算法的实现。deflate，zip文件用的就是这一算法。与gzip的不同之处在于，你可以指定算法的压缩级别，这样你可以在压缩时间和输出文件大小上进行平衡。可选的级别有0（不压缩），以及1(快速压缩)到9（慢速压缩）。它的实现是java.util.zip.DeflaterOutputStrea
复制链接

扫一扫