zstd 压缩算法

1.    Ztsandard介绍

Zstandard(或Zstd)是由Facebook的Yann Collet开发的一个无损数据压缩算法,Zstandard在设计上与DEFLATE(.zip、gzip)算法有着差不多的压缩比,但有更高的压缩和解压缩速度。在其官网(https://github.com/facebook/zstd)给出的性能测试中,Zstandard比snappy、lzo等算法有较高的优势。

Compressor name

Ratio

Compression

Decompress.

zstd 1.4.5 -1

2.884

500 MB/s

1660 MB/s

zlib 1.2.11 -1

2.743

90 MB/s

400 MB/s

brotli 1.0.7 -0

2.703

400 MB/s

450 MB/s

zstd 1.4.5 --fast=1

2.434

570 MB/s

2200 MB/s

zstd 1.4.5 --fast=3

2.312

640 MB/s

2300 MB/s

quicklz 1.5.0 -1

2.238

560 MB/s

710 MB/s

zstd 1.4.5 --fast=5

2.178

700 MB/s

2420 MB/s

lzo1x 2.10 -1

2.106

690 MB/s

820 MB/s

lz4 1.9.2

2.101

740 MB/s

4530 MB/s

zstd 1.4.5 --fast=7

2.096

750 MB/s

2480 MB/s

lzf 3.6 -1

2.077

410 MB/s

860 MB/s

snappy 1.1.8

2.073

560 MB/s

1790 MB/s

       Zstd算法可以通过参数--fast来权衡压缩比与解压缩速度。解压速度越高,压缩比约低。Hive3.1.1中Orc默认采用zlib作为压缩算法(OrcConfig类中orc.compress参数指定),parquet格式默认不压缩。Zstd在最高压缩率的情况下,其压缩速度是zlib的5.56倍,解压速度是其4.15倍。所以如果hive的orc和parquet格式默认采用zstd算法,那么在hive的map读数据阶段,可以极大的减少数据解压耗时,在reduce阶段,减少数据压缩的耗时,在整体上可以提升hive的性能。

2.    Hadoop开启Zstd压缩能力

HADOOP-13578(https://issues.apache.org/jira/browse/HADOOP-13578) 在Hadoop3中增加了Zstd压缩本地库,需要依赖facebook的Zstd库。编译Hadoop时开启Zstd本地库编译的步骤如下:

1.     下载编译并安装Zstd依赖库

wget https://github.com/facebook/zstd/releases/download/v1.4.4/zstd-1.4.4.tar.gz

tar -xzf zstd-1.4.4.tar.gz

cd zstd-1.4.4

make && make install

2.     编译Hadoop3时默认是不开启的,需要在maven参数中设置相关开启参数。

mvn clean package -Dzstd.lib=/usr/local/lib   -Dbundle.zstd=true

参数zstd.lib指向本地库中zstd依赖,使用bundle.zstd表示开启编译zstd,如果本地zstd库找不到,编译会失败。

3.    Hive orc格式设置ZSTD为默认压缩算法。

ORC-363(https://jira.apache.org/jira/browse/ORC-363)增加了zStandard压缩算法,影响版本1.6。hive-3.1.1版本中使用orc-1.5.1,需要升级为orc-1.6.3(当前hive不支持orc-1.6)。

       在hive中设置ORC格式的压缩算法有两种方式:1.建表时在TBLPROPERTIES中增加属性”orc.compress”=”ZSTD” ; 2.设置hive参数hive.exec.orc.default.compress=ZSTD。第一中方式需要对每张表进行设置,第二种方式是针对hive全局设置的,比较方便。因此在hive-site.xml中做如下的配置即可开启ORC的ZSTD压缩算法。

<span style="color:#000000"><span style="color:#cccccc"><code class="language-javascript"><span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span>hive<span style="color:#cccccc">.</span>exec<span style="color:#cccccc">.</span>orc<span style="color:#cccccc">.</span>default<span style="color:#cccccc">.</span>compress<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span><span style="color:#f8c555">ZSTD</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>description<span style="color:#67cdcc">></span>orc<span style="color:#67cdcc">-</span><span style="color:#f08d49">1.6</span><span style="color:#f08d49">.0</span>可选的值:<span style="color:#f8c555">NONE</span><span style="color:#cccccc">,</span><span style="color:#f8c555">ZLIB</span><span style="color:#cccccc">,</span><span style="color:#f8c555">SNAPPY</span><span style="color:#cccccc">,</span><span style="color:#f8c555">LZO</span><span style="color:#cccccc">,</span><span style="color:#f8c555">LZ4</span><span style="color:#cccccc">,</span><span style="color:#f8c555">ZSTD</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>description<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span></code></span></span>
 

4.    Hive parquet格式设置ZSTD为默认压缩算法

Hive Parquet默认不采用压缩算法,有两种方式可以修改压缩算法:

1.在TBLPROPERTIES中设置参数”parquet.compression”=”zstd”;

2.设置Hadoop的参数来指定parquet压缩算法,

<span style="color:#000000"><span style="color:#cccccc"><code class="language-javascript"><span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span> mapreduce<span style="color:#cccccc">.</span>output<span style="color:#cccccc">.</span>fileoutputformat<span style="color:#cccccc">.</span>compress   <span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span><span style="color:#f08d49">true</span><span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span>property<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>name<span style="color:#67cdcc">></span> mapreduce<span style="color:#cccccc">.</span>output<span style="color:#cccccc">.</span>fileoutputformat<span style="color:#cccccc">.</span>compress<span style="color:#cccccc">.</span>codec <span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>name<span style="color:#67cdcc">></span>
    <span style="color:#67cdcc"><</span>value<span style="color:#67cdcc">></span> org<span style="color:#cccccc">.</span>apache<span style="color:#cccccc">.</span>hadoop<span style="color:#cccccc">.</span>io<span style="color:#cccccc">.</span>compress<span style="color:#cccccc">.</span>ZStandardCodec<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>value<span style="color:#67cdcc">></span>
<span style="color:#67cdcc"><</span><span style="color:#67cdcc">/</span>property<span style="color:#67cdcc">></span></code></span></span>

登录后可下载附件,请登录或者注册

 

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值