如何在CDH中使用LZO压缩

最新推荐文章于 2021-05-21 09:20:16 发布

在这条路上一直走下去

最新推荐文章于 2021-05-21 09:20:16 发布

阅读量870

点赞数

分类专栏： CDH 文章标签： CDH

CDH 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1.问题描述
CDH中默认不支持Lzo压缩编码，需要下载额外的Parcel包，才能让Hadoop相关组件如HDFS，Hive，Spark支持Lzo编码。
具体请参考：
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_gpl_extras.html

https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html#xd_583c10bfdbd326ba-3ca24a24-13d80143249--7ec6
首先我在没做额外配置的情况下，生成Lzo文件并读取。我们在Hive中创建两张表，test_table和test_table是文本文件的表，test_table2是Lzo压缩编码的表。如下：


create  external table test_table
(
s1  string,
s2  string
)
row  format delimited fields terminated by '#'
location  '/lilei/test_table';
 
insert  into test_table values('1','a'),('2','b');
 
create  external table test_table2
(
s1  string,
s2  string
)
row  format delimited fields terminated by '#'
location  '/lilei/test_table2';

通过beeline访问Hive并执行上面命令

将test_table中的数据插入到test_table2,并设置输出文件为Lzo压缩：

set  mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set  hive.exec.compress.output=true;
set  mapreduce.output.fileoutputformat.compress=true;
set  mapreduce.output.fileoutputformat.compress.type=BLOCK;
 
insert  overwrite table test_table2 select * from test_table;

在Hive中执行报错如下:

Error:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

通过Yarn 的8088可以发现是因为找不到lzo压缩编码:

Compression codec com.hadoop.compression.lzo.LzoCodec was not found.

在这里插入图片描述
2.解决办法

通过Cloudera Manager的Parcel页面配置Lzo的Parcel包地址:
注意：如果集群无法访问公网，需要提前下载好Parcel包并发布到httpd
下载- >分配- > 激活

配置HDFS的压缩编码加入Lzo:


com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

在这里插入图片描述

保存更改，部署客户端配置，重启整个集群。

等待重启成功

再次插入数据到test_table2,设置Lzo编码格式:


set  mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set  hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set  mapreduce.output.fileoutputformat.compress.type=BLOCK;
 
insert  overwrite table test_table2 select * from test_table;

插入成功:
在这里插入图片描述
2.1Hive验证
首先确认test_table2中的文件为Lzo格式:

在Hive的beeline中进行测试

Hive基于Lzo压缩文件运行正常。
2.2Spark SQL验证


var  textFile=sc.textFile("hdfs://ip-172-31-8-141:8020/lilei/test_table2/000000_0.lzo_deflate")
 
textFile.count()
 
sqlContext.sql("select  * from test_table2")

在这里插入图片描述
SparkSQL基于Lzo压缩文件运行正常。

在这条路上一直走下去

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
如何在CDH中使用LZO压缩

1.问题描述CDH中默认不支持Lzo压缩编码，需要下载额外的Parcel包，才能让Hadoop相关组件如HDFS，Hive，Spark支持Lzo编码。具体请参考：https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_gpl_extras.htmlhttps://www.cloudera.com/docum...
复制链接

扫一扫