Common：压缩

最新推荐文章于 2022-06-01 23:41:44 发布

罗鹏_1022

最新推荐文章于 2022-06-01 23:41:44 发布

阅读量434

点赞数

分类专栏： Hadoop 文章标签： hadoop 海量数据压缩

本文链接：https://blog.csdn.net/luopeng123456789/article/details/49020103

版权

Hadoop 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

计算机存储的数据都存在一定的冗余，同时数据之间，尤其是相邻数据间存在着相关性，所以可以通过一些有别于原始编码的特殊编码来保存数据，使数据占用的存储空间减小，这个过程就是压缩。

在Hadoop中，压缩应用于文件存储、Map阶段到Reduce阶段的数据交换等情景，主要考虑的是压缩速度和压缩文件的可分割性。

所有的压缩算法都需要面临时间和空间的权衡问题，压缩速度越快，压缩比越低，能够节省的空间越少；压缩速度越慢，压缩比越高，相应的就能节省大量存储空间。gzip和zip是通用的压缩工具，在时间/空间处理上相对平衡，gzip2压缩比gzip和zip更有效，但速度较慢。而且bzip2的解压缩速度快于它的压缩速度。

当使用MapReduce处理压缩文件时，需要考虑文件的可分割性。gzip不能从数据流中的某个点开始解压缩，bzip2格式压缩文件中，块与块间提供了一个48位的同步标记，bzip2支持数据分割。

为了支持多种压缩/解压缩算法，Hadoop实现了压缩框架，包括编码/解码器及其工厂、压缩器/解压器、压缩流/解压缩流三种组件，它们相互配合满足了用户对压缩功能的需求。

编码/解码器，使用的是抽象工厂模式

①.压缩器(Compressor)
Compressor可以插入压缩输出流的实现中，提供具体的压缩功能。相反，Decompressor提供具体的解压功能并插入CompressionInputStream中。Compressor接收数据到内部缓冲区，当缓冲区已经满，调用compress()方法获取压缩后的数据，释放缓冲区空间。Compressor源代码如下：

/**
 * Specification of a stream-based 'compressor' which can be  
 * plugged into a {@link CompressionOutputStream} to compress data.
 */
public interface Compressor {
  /**
   * Sets input data for compression. 
   * This should be called whenever #needsInput() returns 
   * true indicating that more input data is required.
   * 
   * @param b Input data
   * @param off Start offset
   */
  public void setInput(byte[] b, int off, int len);
  
  /**
   * Returns true if the input data buffer is empty and 
   * #setInput() should be called to provide more input. 
   */
  public boolean needsInput();
  
  /**
   * Return number of uncompressed bytes input so far.
   */
  public long getBytesRead();

  /**
   * Return number of compressed bytes output so far.
   */
  public long getBytesWritten();

  /**
   * When called, indicates that compression should end
   * with the current contents of the input buffer.
   */
  public void finish();
  
  /**
   * Returns true if the end of the compressed 
   * data output stream has been reached.
   */
  public boolean finished();
  
  /**
   * Fills specified buffer with compressed data. Returns actual number
   * of bytes of compressed data. A return value of 0 indicates that
   * needsInput() should be called in order to determine if more input
   * data is required.
   * 
   * @param b Buffer for the compressed data
   * @param off Start offset of the data
   * @param len Size of the buffer
   * @return The actual number of bytes of compressed data.
   */
  public int compress(byte[] b, int off, int len) throws IOException;
  
  /**
   * Resets compressor so that a new set of input data can be processed.
   */
  public void reset();
  
  /**
   * Closes the compressor and discards any unprocessed input.
   */
  public void end();

  /**
   * Prepare the compressor to be used in a new stream with settings defined in
   * the given Configuration
   */
  public void reinit(Configuration conf);
}

代码解读：

Compressor通过setInput()方法接收数据到内部缓冲区，如果needsInput()返回false，表明缓冲区已经满，这时必须通过compress()方法获取压缩后的数据，释放缓冲区空间。为了提高效率，并不是每次调用setInput()方法，压缩器就马上工作，所以需要调用finish()方法通知压缩器所有数据已经写入完成。finished()是用来判断压缩器中是否还有未读取的压缩数据。在压缩过程中，可以通过getBytesRead()和getBytesWritten()方法获得未压缩字节总数和已输出压缩字节总数。reset()方法用于重置压缩器，以处理新的输入数据集合，reinit()更进一步允许使用Hadoop的配置来重置压缩器。(Decompressor的代码类似)

②.压缩流(CompressionOutputStream)
Java最初版本的输入/输出系统是基于流的，流抽象了任何有能力产生/接收数据的数据源。
抽象类CompressionOutputStream继承自OutputStream，继承中的三个方法：用于输出的write()方法、用于结束压缩过程并将输入写到底层流的finish()方法、重置压缩状态的resetState()方法还是抽象方法，需要子类去实现。Hadoop的CompressionOutputStream规定了压缩流的对外接口，并提供了一个通用的、使用压缩器来实现的压缩流——CompressorStream类。

public class CompressorStream extends CompressionOutputStream {

  public void write(byte[] b, int off, int len) throws IOException {
    ......
    compressor.setInput(b, off, len);
    while (!compressor.needsInput()) {
      compress();
    }
  }

  protected void compress() throws IOException {
    int len = compressor.compress(buffer, 0, buffer.length);
    if (len > 0) {
      out.write(buffer, 0, len);
    }
  }

  public void finish() throws IOException {
    if (!compressor.finished()) {
      compressor.finish();
      while (!compressor.finished()) {
        compress();
      }
    }
  }

  public void resetState() throws IOException {
    compressor.reset();
  }
}

罗鹏_1022

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Common：压缩

计算机存储的数据都存在一定的冗余，同时数据之间，尤其是相邻数据间存在着相关性，所以可以通过一些有别于原始编码的特殊编码来保存数据，使数据占用的存储空间减小，这个过程就是压缩，相对应的就是解压缩。压缩广泛应用于海量数据处理中，对数据文件进行压缩，可以有效减少存储文件所需的空间，并加快数据在网络上的传输效率。在Hadoop中，压缩应用于文件存储、Map阶段到Reduce阶段的数据交换等情景，其主要考虑
复制链接

扫一扫

专栏目录