本文通过实例讲述在CDH5下面配置LZO压缩的过程,
1 下载parcel(选择合适的版本),下载地址: http://archive-primary.cloudera.com/gplextras/parcels/latest/ ;下载.parcel文件及manifest.json文件,下载完成后在manifest.json中找到对应的hash值并写到.parcel.sha文件
[root@cent-1 lzo]# ls
HADOOP_LZO-0.4.15-1.gplextras.p0.123-el6.parcel HADOOP_LZO-0.4.15-1.gplextras.p0.123-el6.parcel.sha manifest.json
2 将上述下载及生成的parcel及parcel.sha文件复制到/opt/cloudera/parcel-repo下面
[root@cent-1 parcel-repo]# pwd
/opt/cloudera/parcel-repo
[root@cent-1 parcel-repo]# ls
CDH-5.6.0-1.cdh5.6.0.p0.45-el6.parcel CDH-5.6.0-1.cdh5.6.0.p0.45-el6.parcel.sha HADOOP_LZO-0.4.15-1.gplextras.p0.123-el6.parcel HADOOP_LZO-0.4.15-1.gplextras.p0.123-el6.parcel.sha
3 打开CDH Manager,从主机->parcel下面找到HADOOP_LZO,并分配激活
4 修改hdfs配置,将io.compression.codecs追加属性com.hadoop.compression.lzo.LzopCodec
5 修改yarn配置(如果没有yarn先安装下yarn),将mapreduce.application.classpath的属性值修改为:
HADOOPMAPREDHOME/∗,
HADOOP_MAPRED_HOME/lib/,$MR2_CLASSPATH,/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/,将mapreduce.admin.user.env的属性值修改为:
LD_LIBRARY_PATH=
HADOOPCOMMONHOME/lib/native:
JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
6 进入hive CLI,创建LZO压缩格式表
hive> create external table lzo(id int,name string)
> row format delimited fields terminated by ','
> STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> location '/user/hive';
OK
Time taken: 0.867 seconds
7 创建测试文件,并使用lzop命令压缩,然后上传到hdfs上
[root@cent-1 ~]# cat lzo.txt
1, AAA
2, BBB
3, CCC
4, DDD
5, EEE
[root@cent-1 ~]# lzop lzo.txt
[root@cent-1 ~]# ls
lzo.txt lzo.txt.lzo
[root@cent-1 ~]# cat lzo.txt.lzo
▒LZO
0 @▒▒Xs
lzo.txtD▒▒##v▒▒1, AAA
2, BBB
3, CCC
4, DDD
5, EEE
[hdfs@cent-1 ~]$ hadoop fs -copyFromLocal lzo.txt.lzo /user/hive/
[hdfs@cent-1 ~]$ hadoop fs -ls /user/hive
Found 2 items
-rw-r--r-- 3 hdfs hive 96 2017-01-09 11:32 /user/hive/lzo.txt.lzo
drwxrwxrwt - hive hive 0 2017-01-03 11:20 /user/hive/warehouse
8 从hive CLI中查询表
hive> select * from lzo;
OK
1 AAA
2 BBB
3 CCC
4 DDD
5 EEE
9 注:如果是需要在session中通过insert向hive LZO表插入数据,需要在insert之前添加以下两条设置命令
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;