hdfs支持lzo压缩相关配置

本文转载至https://blog.csdn.net/weixin_40420525/article/details/84869883,并进行实践,总结了其中遇到的问题。

1前置环境:

1.java环境与maven
2.安装前置库(如果已经编译过Hadoop,这些东西都应该下载过)

	yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool

2.安装lzo

	[hadoop@hadoop software]$ wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
	[hadoop@hadoop app]$ tar -zxvf lzo-2.06.tar.gz -C ../app

	[hadoop@hadoop app]$ cd lzo-2.06/
	[hadoop@hadoop lzo-2.06]$ export CFLAGS=-m64
	
	# 创建文件夹,用来存放编译之后的lzo
	[hadoop@hadoop lzo-2.06]$ mkdir lzo
	
	#指定编译之后的位置
	[hadoop@hadoop lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/lzo/
	
	#开始编译安装
	[hadoop@hadoop lzo-2.06]$ make && make install
	
	# 查看编译是否成功 只要有如下内容 就可以了
	[hadoop@hadoop lzo-2.06]$ cd complie/
	[hadoop@hadoop complie]$ ll
	total 12
	drwxrwxr-x 3 hadoop hadoop 4096 Dec  6 17:08 include
	drwxrwxr-x 2 hadoop hadoop 4096 Dec  6 17:08 lib
	drwxrwxr-x 3 hadoop hadoop 4096 Dec  6 17:08 share

3.安装hadoop-lzo

3.1下载并解压
	[hadoop@hadoop software]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip

	#解压 -d可以指定解压目录
	[hadoop@hadoop software]$ unzip master.zip -d ../app
	
	# 如果提示没有 unzip  记得用yum 安装,需root权限
	[root@hadoop ~]# *yum -y install unzip*
3.2 修改解压后目录中饭的pom.xml文件
		<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
		<hadoop.current.version>2.6.0</hadoop.current.version> #这里修改成对应的hadoop版本号,我这里是hadoop2.6.0-cdh5.7.0
		<hadoop.old.version>1.0.4</hadoop.old.version>
  		</properties>
3.3 增加配置
	[hadoop@hadoop app]$ cd hadoop-lzo-master/
	[hadoop@hadoop hadoop-lzo-master]$ export CFLAGS=-m64
	[hadoop@hadoop hadoop-lzo-master]$  export CXXFLAGS=-m64
	[hadoop@hadoop hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/lzo/include/     # 这里需要提供编译好的lzo的include文件
	[hadoop@hadoop hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/lzo/lib/           # 这里需要提供编译好的lzo的lib文件
3.4编译
	[root@hadoop hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true  

这步如果编译报错,切换root权限试试,本人遇到了这个问题
出现buildsuccess即成功

3.5进行如下操作
	[hadoop@hadoop hadoop-lzo-master]$ 
	# 查看编译成功之后的包
	[hadoop@hadoop hadoop-lzo-master]$ ll
	total 80
	-rw-rw-r--  1 hadoop hadoop 35147 Oct 13  2017 COPYING
	-rw-rw-r--  1 hadoop hadoop 19753 Dec  6 17:18 pom.xml
	-rw-rw-r--  1 hadoop hadoop 10170 Oct 13  2017 README.md
	drwxrwxr-x  2 hadoop hadoop  4096 Oct 13  2017 scripts
	drwxrwxr-x  4 hadoop hadoop  4096 Oct 13  2017 src
	drwxrwxr-x 10 hadoop hadoop  4096 Dec  6 17:21 target
	
	# 进入target/native/Linux-amd64-64 目录下执行如下命令
	[hadoop@hadoop hadoop-lzo-master]$ cd target/native/Linux-amd64-64
	[hadoop@hadoop Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
	./
	./libgplcompression.so
	./libgplcompression.so.0
	./libgplcompression.la
	./libgplcompression.a
	./libgplcompression.so.0.0.
	[hadoop@hadoop Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
	
	
	# 这里很重要  需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 复制到hadoop中
	[hadoop@hadoop hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/ 
	[hadoop@hadoop hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib

4.修改Hadoop相关配置文件

4.1 修改 $HADOOP_HOME/etc/hadoop/hadoop-env.sh
	export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib
4.2 修改 $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
    	 <name>io.compression.codecs</name>
   		 <value>org.apache.hadoop.io.compress.GzipCodec,
	            org.apache.hadoop.io.compress.DefaultCodec,
	            org.apache.hadoop.io.compress.BZip2Codec,
	            com.hadoop.compression.lzo.LzoCodec,
	            com.hadoop.compression.lzo.LzopCodec
    	</value>
</property>
<property>
   		 <name>io.compression.codec.lzo.class</name>
   		 <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
4.3 修改 $HADOOP_HOME/etc/hadoop/mapred-site.xml
	<property>
	    <name>mapred.child.env </name>
	    <value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
	</property>
	<property>
	    <name>mapreduce.map.output.compress</name>
	    <value>true</value>
	</property>
	<property>
	    <name>mapreduce.map.output.compress.codec</name>
	    <value>com.hadoop.compression.lzo.LzoCodec</value>
	</property>

5.配置成功后实验

	#准备一份大小为600M的数据
	[hadoop@hadoop001 data]$ ll -h
	-rw-r--r-- 1 hadoop hadoop 601M 4月  15 09:54 gen_logs
	
	#使用lzo压缩
	[hadoop@hadoop001 data]$ lzop gen_logs
	[hadoop@hadoop001 data]$ ll -h
	-rw-r--r-- 1 hadoop hadoop 601M 4月  15 09:54 gen_logs
	-rw-r--r-- 1 hadoop hadoop 231M 4月  15 09:54 gen_logs.lzo
	
	#将该文件上传至hdfs
	[hadoop@hadoop001 data]$ hadoop fs -put gen_logs.lzo /log
	[hadoop@hadoop ~]$ hadoop fs -ls /log
	Found 1 items
	-rw-r--r--   1 hadoop supergroup  241258919 2019-04-16 13:40 /log/gen_logs.lzo

	#执行一次wordcount
	[hadoop@hadoop mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /log/gen_logs.lzo /output

可以看到,number of splits:1 只有1个:

	19/04/17 14:30:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
	19/04/17 14:30:07 INFO input.FileInputFormat: Total input paths to process : 1
	19/04/17 14:30:07 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
	19/04/17 14:30:07 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
	19/04/17 14:30:08 INFO mapreduce.JobSubmitter: number of splits:1

我们知道,gzip默认可以分片,而lzo默认不可以分片,但可以通过创建索引的方式来支持分片,所以,我们创建该文件的索引

	[hadoop@hadoop000 hadoop$ hadoop jar \
	share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar \
	com.hadoop.compression.lzo.DistributedLzoIndexer \
	/log/gen_logs.lzo

	#查看是否生成了索引
	[hadoop@hadoop hadoop]$ hadoop fs -ls /log
	Found 2 items
	-rw-r--r--   1 hadoop supergroup  241258919 2019-04-16 13:40 /log/gen_logs.lzo
	-rw-r--r--   1 hadoop supergroup      19208 2019-04-16 13:50 /log/gen_logs.lzo.index

单单生成索引文件是不够的,在运行程序的时候还要对要运行的程序进行相应的更改,
把inputformat设置成LzoTextInputFormat,不然还是会把索引文件也当做是输入文件,还是只运行一个map来处理。

	[hadoop@hadoop mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /log/gen_logs.lzo /output
	19/04/17 14:47:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
	19/04/17 14:47:51 INFO input.FileInputFormat: Total input paths to process : 1
	19/04/17 14:47:52 INFO mapreduce.JobSubmitter: number of splits:2
	......

可以看出number of splits:2,成功。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值