编译hadoop-3.2.2源码并支持lzo压缩

一、编译hadoop-3.2.2源码

1.参考官网:

https://github.com/apache/hadoop/blob/trunk/BUILDING.txt

2.编译环境

虚拟机:VM15
Linux系统:Centos7
Jdk版本:jdk1.8
cmake版本:3.20.2
Hadoop版本:3.2.2
Maven版本:3.8.4
Protobuf版本:2.5.0
findbugs版本(可以不用):findbugs-3.0.1
apache-ant版本(可以不用):apache-ant-1.10.12

下载地址:

Hadoop:
https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2-src.tar.gz

cmake:
https://cmake.org/files/v3.20/cmake-3.20.2.tar.gz

Protobuf:
https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

findbugs:
http://prdownloads.sourceforge.net/findbugs/findbugs-3.0.1.tar.gz?download

apache-ant:
https://dlcdn.apache.org//ant/binaries/apache-ant-1.10.12-bin.tar.gz
2.1 安装依赖

root用户下运行如下命令,安装依赖:

yum install gcc gcc-c++ gcc-header make autoconf automake libtool curl lzo-devel zlib-devel openssl openssl-devel ncurses-devel snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop libXtst zlib -y

java和maven:之前已经安装好,其中java是在root用户下,maven是在普通用户下。

安装protobuf

tar -zxvf protobuf-2.5.0.tar.gz -C /opt/
cd /opt/
cd protobuf-2.5.0/
./configure &&
make &&
make check &&
make install

ldconfig

验证是否成功 : protoc --version测试是否安装成功

安装cmake

tar -zxvf cmake-3.20.2.tar.gz -C /opt/
cd /opt/cmake-3.20.2/
./configure
make && make install

ldconfig

验证是否成功 :cmake --version测试是否安装成功

安装findbugs

tar -zxvf findbugs-3.0.1.tar.gz\?download -C /opt/

安装apache-ant

tar -zxvf apache-ant-1.10.12-bin.tar.gz -C /opt/

配置环境变量 /etc/profile

export PROTOBUF_HOME=/opt/protobuf-2.5.0
export ANT_HOME=/opt/apache-ant-1.10.12
export CMAKE_HOME=/opt/cmake-3.20.2
export FIND_BUGS_HOME=/opt/findbugs-3.0.1

export PATH=$PROTOBUF_HOME:$CMAKE_HOME/bin:$ANT_HOME/bin:$FIND_BUGS_HOME/bin:$PATH

3. 编译hadoop-3.2.2

进入普通用户,进入源码目录,执行编译命令:

tar -zxvf hadoop-3.2.2-src.tar.gz -C ~/sourcecode/
cd hadoop-3.2.2-src/
mvn clean package -DskipTests -Pdist,native -Dtar

运行结果:
在这里插入图片描述
编译成功后,会在hadoop-dist/target/目录下有个hadoop-3.2.2.tar.gz文件。
hadoop-3.2.2.tar.gz如何部署,这里不再叙述。

二、支持lzo压缩

之前写了一篇,可参考:

hadoop支持LZO压缩文件的分片的过程

Hadoop本身是不支持lzo压缩的,用hadoop checknative可以检查,是没有lzo的:

[ruoze@hadoop001 hadoop]$ hadoop checknative
2022-01-27 13:41:59,618 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
2022-01-27 13:41:59,620 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2022-01-27 13:41:59,623 WARN zstd.ZStandardCompressor: Error loading zstandard native libraries: java.lang.InternalError: Cannot load libzstd.so.1 (libzstd.so.1: cannot open shared object file: No such file or directory)!
2022-01-27 13:41:59,629 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable
2022-01-27 13:41:59,684 INFO nativeio.NativeIO: The native code was built without PMDK support.
Native library checking:
hadoop:  true /home/ruoze/app/hadoop-3.2.2/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
zstd  :  false 
snappy:  true /lib64/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
ISA-L:   false libhadoop was built without ISA-L support
PMDK:    false The native code was built without PMDK support.
1. Linux上安装LZO和LZOP

这里已经安装好,没有安装的自行安装:

[ruoze@hadoop001 data]$ which lzop
/bin/lzop

lzo压缩:lzop -v file
lzo解压:lzop -dv file
2. 获取hadoop-lzo源码,并解压
wget https://github.com/twitter/hadoop-lzo/archive/master.zip
unzip master.zip  -d  ~/sourcecode/
[ruoze@hadoop001 sourcecode]$ cd hadoop-lzo-master/
[ruoze@hadoop001 hadoop-lzo-master]$ ll
total 68
-rw-rw-r--. 1 ruoze ruoze 35151 Mar  5  2021 COPYING
-rw-rw-r--. 1 ruoze ruoze 19760 Mar  5  2021 pom.xml
-rw-rw-r--. 1 ruoze ruoze 10179 Mar  5  2021 README.md
drwxrwxr-x. 2 ruoze ruoze    34 Mar  5  2021 scripts
drwxrwxr-x. 4 ruoze ruoze    28 Mar  5  2021 src
[ruoze@hadoop001 hadoop-lzo-master]$

修改pom.xml,与自己的hadoop版本一致

[ruoze@hadoop001 hadoop-lzo-master]$ vi pom.xml 

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>3.2.2</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
  </properties>

root用户下执行:

yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
yum install -y git
2. 编译hadoop-lzo

返回普通用户,进入hadoop-lzo-master目录,执行编译命令:

mvn clean package -Dmaven.test.skip=true

在这里插入图片描述
把target目录下的hadoop-lzo-0.4.21-SNAPSHOT.jar包拷贝到$HADOOP_HOME/share/hadoop/common/下面

cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
3. 配置Hadoop的core-site.xml和mapred-site.xml文件

core-site.xml添加如下的代码:

      <property>
         <name>io.compression.codecs</name>
         <value>
            org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            org.apache.hadoop.io.compress.SnappyCodec,
            com.hadoop.compression.lzo.LzoCodec,             
            com.hadoop.compression.lzo.LzopCodec
         </value>
      </property>
      <property>
         <name>io.compression.codec.lzo.class</name>
         <value>com.hadoop.compression.lzo.LzoCodec</value> 
      </property>
 #主要是配置com.hadoop.compression.lzo.LzoCodec、com.hadoop.compression.lzo.LzopCodec压缩类
#io.compression.codec.lzo.class必须指定为LzoCodec非LzopCodec,不然压缩后的文件不会支持分片的

mapred-site.xml添加如下的代码:

#中间阶段的压缩
<property>    
    <name>mapred.compress.map.output</name>    
    <value>true</value>    
</property>
<property>    
    <name>mapred.map.output.compression.codec</name>    
    <value>com.hadoop.compression.lzo.LzoCodec</value>    
</property>

#最终阶段的压缩
<property>
   <name>mapreduce.output.fileoutputformat.compress</name>
   <value>true</value>
</property>

<property>
   <name>mapreduce.output.fileoutputformat.compress.codec</name>
   <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>	

core-site.xml 跟 mapred-site.xml 这两个文件如果是集群机器,也要同步修改,然后启动集群。

4. 测试hadoop-3.2.2是否支持lzo压缩

这里有一个造的word数据,616M,使用lzo压缩后是190M.

[ruoze@hadoop001 data]$ lzop -v makedatawordcount.txt 
compressing makedatawordcount.txt into makedatawordcount.txt.lzo
[ruoze@hadoop001 data]$ ll -h
-rw-r--r--. 1 ruoze ruoze 616M Jan 27 10:39 makedatawordcount.txt
-rw-r--r--. 1 ruoze ruoze 190M Jan 27 10:39 makedatawordcount.txt.lzo

[ruoze@hadoop001 data]$ hdfs dfs -put makedatawordcount.txt.lzo /data/

190M大于一个块的大小128M。这样可以模拟后面hadoop支持LZO压缩文件的分片的过程。
测试一:
在core-site、mapred-site配置文件中,没有配置lzo之前,使用如下命令,看到的是乱码:

hdfs dfs -text /data/makedatawordcount.txt.lzo

两个配置文件配置了之后,再使用上面命令,就不是乱码了,说明hadoop-3.2.2配置了lzo之后,是支持lzo压缩的。
测试二:

[ruoze@hadoop001 hadoop]$ find ./ -name *example*.jar
./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar
./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-test-sources.jar
./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-sources.jar
[ruoze@hadoop001 hadoop]$ 

[ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /data/makedatawordcount.txt.lzo /output

2022-01-27 13:25:58,656 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2022-01-27 13:25:58,973 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0001
2022-01-27 13:25:59,108 INFO input.FileInputFormat: Total input files to process : 1
2022-01-27 13:25:59,122 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:25:59,123 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:25:59,999 INFO mapreduce.JobSubmitter: number of splits:1
2022-01-27 13:26:00,528 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0001
2022-01-27 13:26:00,529 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-27 13:26:00,696 INFO conf.Configuration: resource-types.xml not found
2022-01-27 13:26:00,696 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-27 13:26:00,854 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0001
2022-01-27 13:26:00,920 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0001/
2022-01-27 13:26:00,921 INFO mapreduce.Job: Running job: job_1643252767325_0001
2022-01-27 13:26:06,005 INFO mapreduce.Job: Job job_1643252767325_0001 running in uber mode : false
2022-01-27 13:26:06,006 INFO mapreduce.Job:  map 0% reduce 0%
2022-01-27 13:26:22,142 INFO mapreduce.Job:  map 38% reduce 0%
2022-01-27 13:26:28,166 INFO mapreduce.Job:  map 56% reduce 0%
2022-01-27 13:26:33,234 INFO mapreduce.Job:  map 100% reduce 0%
2022-01-27 13:26:40,280 INFO mapreduce.Job:  map 100% reduce 100%
2022-01-27 13:26:40,289 INFO mapreduce.Job: Job job_1643252767325_0001 completed successfully
2022-01-27 13:26:40,341 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=46027310
		FILE: Number of bytes written=51351200
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=162500380
		HDFS: Number of bytes written=1639998
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=25572
		Total time spent by all reduces in occupied slots (ms)=3491
		Total time spent by all map tasks (ms)=25572
		Total time spent by all reduce tasks (ms)=3491
		Total vcore-milliseconds taken by all map tasks=25572
		Total vcore-milliseconds taken by all reduce tasks=3491
		Total megabyte-milliseconds taken by all map tasks=26185728
		Total megabyte-milliseconds taken by all reduce tasks=3574784
	Map-Reduce Framework
		Map input records=13100000
		Map output records=13100000
		Map output bytes=567665192
		Map output materialized bytes=4853171
		Input split bytes=117
		Combine input records=17818436
		Combine output records=5249877
		Reduce input groups=531441
		Reduce shuffle bytes=4853171
		Reduce input records=531441
		Reduce output records=531441
		Spilled Records=5781318
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=214
		CPU time spent (ms)=31570
		Physical memory (bytes) snapshot=763604992
		Virtual memory (bytes) snapshot=5603123200
		Total committed heap usage (bytes)=698351616
		Peak Map Physical memory (bytes)=505016320
		Peak Map Virtual memory (bytes)=2800398336
		Peak Reduce Physical memory (bytes)=258588672
		Peak Reduce Virtual memory (bytes)=2802724864
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=162500263
	File Output Format Counters 
		Bytes Written=1639998
[ruoze@hadoop001 hadoop]$ 


[ruoze@hadoop001 hadoop]$ hdfs dfs -text /output/*
2022-01-27 13:40:55,134 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:40:55,137 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:40:55,140 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
2022-01-27 13:40:55,141 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
Apple	10663851
China	10664268
WORD	10674905
beijing	10667757
bigdata	10666953
china	10662630
shanghai	10666100
word	10670000
world	10663536

由测试二可知,输入是lzo压缩文件,文件里面是单词,跑了一个wc案例,得到了一个结果,说明对Hadoop配置了lzo之后,就可以支持压缩了。

但是从上面wc过程可以看到虽然lzo压缩文件是190M,大于一个块的大小,但是分片仍然是1:number of splits:1,说明还不支持分片。
那么怎么支持分片?
过程如下:
对上面的lzo文件用hadoop-lzo-0.4.21-SNAPSHOT.jar这个jar包建立索引,

[ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/
Found 1 items
-rw-r--r--   1 ruoze supergroup  198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo
[ruoze@hadoop001 hadoop]$ 
[ruoze@hadoop001 hadoop]$ hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.LzoIndexer /data/makedatawordcount.txt.lzo
2022-01-27 13:50:01,977 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:50:01,978 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:50:02,422 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /data/makedatawordcount.txt.lzo, size 0.18 GB...
2022-01-27 13:50:02,683 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.26 seconds (725.23 MB/s).  Index size is 19.23 KB.

[ruoze@hadoop001 hadoop]$ 
[ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/
Found 2 items
-rw-r--r--   1 ruoze supergroup  198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo
-rw-r--r--   1 ruoze supergroup      19696 2022-01-27 13:50 /data/makedatawordcount.txt.lzo.index

以看到在同级目录下生成了 makedatawordcount.lzo.index索引文件。

再跑一次wordcount,可以看到分片数是2个了:number of splits:2

[ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar \
> wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /data/makedatawordcount.txt.lzo /output
2022-01-27 14:11:11,665 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2022-01-27 14:11:11,999 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0007
2022-01-27 14:11:12,529 INFO input.FileInputFormat: Total input files to process : 1
2022-01-27 14:11:13,450 INFO mapreduce.JobSubmitter: number of splits:2
2022-01-27 14:11:13,550 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0007
2022-01-27 14:11:13,551 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-27 14:11:13,645 INFO conf.Configuration: resource-types.xml not found
2022-01-27 14:11:13,646 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-27 14:11:13,687 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0007
2022-01-27 14:11:13,722 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0007/
2022-01-27 14:11:13,722 INFO mapreduce.Job: Running job: job_1643252767325_0007
2022-01-27 14:11:17,785 INFO mapreduce.Job: Job job_1643252767325_0007 running in uber mode : false
2022-01-27 14:11:17,785 INFO mapreduce.Job:  map 0% reduce 0%
2022-01-27 14:11:33,980 INFO mapreduce.Job:  map 66% reduce 0%
2022-01-27 14:11:40,066 INFO mapreduce.Job:  map 73% reduce 0%
2022-01-27 14:11:46,108 INFO mapreduce.Job:  map 81% reduce 0%
2022-01-27 14:11:48,119 INFO mapreduce.Job:  map 100% reduce 0%
2022-01-27 14:11:49,125 INFO mapreduce.Job:  map 100% reduce 100%
2022-01-27 14:11:51,144 INFO mapreduce.Job: Job job_1643252767325_0007 completed successfully
2022-01-27 14:11:51,196 INFO mapreduce.Job: Counters: 55
	File System Counters
		FILE: Number of bytes read=4992
		FILE: Number of bytes written=711952
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=198561903
		HDFS: Number of bytes written=137
		HDFS: Number of read operations=11
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=3
		Launched reduce tasks=1
		Data-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=54295
		Total time spent by all reduces in occupied slots (ms)=11774
		Total time spent by all map tasks (ms)=54295
		Total time spent by all reduce tasks (ms)=11774
		Total vcore-milliseconds taken by all map tasks=54295
		Total vcore-milliseconds taken by all reduce tasks=11774
		Total megabyte-milliseconds taken by all map tasks=55598080
		Total megabyte-milliseconds taken by all reduce tasks=12056576
	Map-Reduce Framework
		Map input records=16000000
		Map output records=96000000
		Map output bytes=1013322815
		Map output materialized bytes=260
		Input split bytes=234
		Combine input records=96000279
		Combine output records=297
		Reduce input groups=9
		Reduce shuffle bytes=260
		Reduce input records=18
		Reduce output records=9
		Spilled Records=432
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=676
		CPU time spent (ms)=51250
		Physical memory (bytes) snapshot=1249132544
		Virtual memory (bytes) snapshot=8401858560
		Total committed heap usage (bytes)=1059061760
		Peak Map Physical memory (bytes)=506130432
		Peak Map Virtual memory (bytes)=2801958912
		Peak Reduce Physical memory (bytes)=239939584
		Peak Reduce Virtual memory (bytes)=2801532928
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=198561669
	File Output Format Counters 
		Bytes Written=137
[ruoze@hadoop001 hadoop]$ 

总结:需要用如下命令对lzo文件建索引,才能支持分片:

hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.DistributedLzoIndexer  /data/makedatawordcount.txt.lzo
 或者
hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.LzoIndexer  /data/makedatawordcount.txt.lzo

并且用的时候还要加:参数: -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat
如:

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar \
> wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /data/makedatawordcount.txt.lzo /output

如果不加的话分片数还是1。

总结:Hadoop 的native库不支持lzo压缩文件,查看lzo文件也是乱码,需要编译hadoop-lzo,然后把jar包放到hadoop中,并且修改core-site.xml文件,添加lzo相关配置,才能支持lzo文件。并且本身不支持对lzo文件的分片,需要对lzo文件创建了索引之后才能支持分片。

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值