Hive表 Parquet压缩 , Gzip,Snappy,uncompressed 效果对比

 

创建两张表,通过一种是parquet , 一种使用parquet snappy压缩

创建表

使用snappy
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY');

使用gzip
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='GZIP');

使用uncompressed
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='UNCOMPRESSED');

使用默认
CREATE EXTERNAL TABLE IF NOT EXISTS tableName(xxx string)
partitioned by
(pt_xvc string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
STORED AS PARQUET;

也可以在执行语句前执行 set parquet.compression=SNAPPY; 会对之后跑的数据进行压缩,之前已经存在的不会进行snappy压缩
通过 desc formatted tableName 查看表结构

使用parquet snappy

Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        numFiles                25                  
        numPartitions           1                   
        numRows                 0                   
        parquet.compression     SNAPPY              
        rawDataSize             0                   
        totalSize               4570350557          
        transient_lastDdlTime   1552269085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \u0001              
        serialization.format    \u0001              

使用parquet默认

Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        numFiles                25                  
        numPartitions           1                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               4570650197          
        transient_lastDdlTime   1552269039          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \u0001              
        serialization.format    \u0001      

测试数据量:20208432 

UNCOMPRESSED    :4570325699
PARQUET 默认    :4570650197
parquet gzip    :4570314033
parquet snappy  :4570350557
textfile        :10356207038

 

通过对比发现,当数据量较少时parquet各压缩方式差别不大,但相比TEXTFILE压缩减少了1倍以上,后续再做一下性能对比测试一下。

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值