文章目录
一、编译hadoop-3.2.2源码
1.参考官网:
https://github.com/apache/hadoop/blob/trunk/BUILDING.txt
2.编译环境
虚拟机:VM15
Linux系统:Centos7
Jdk版本:jdk1.8
cmake版本:3.20.2
Hadoop版本:3.2.2
Maven版本:3.8.4
Protobuf版本:2.5.0
findbugs版本(可以不用):findbugs-3.0.1
apache-ant版本(可以不用):apache-ant-1.10.12
下载地址:
Hadoop:
https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2-src.tar.gz
cmake:
https://cmake.org/files/v3.20/cmake-3.20.2.tar.gz
Protobuf:
https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
findbugs:
http://prdownloads.sourceforge.net/findbugs/findbugs-3.0.1.tar.gz?download
apache-ant:
https://dlcdn.apache.org//ant/binaries/apache-ant-1.10.12-bin.tar.gz
2.1 安装依赖
root用户下运行如下命令,安装依赖:
yum install gcc gcc-c++ gcc-header make autoconf automake libtool curl lzo-devel zlib-devel openssl openssl-devel ncurses-devel snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop libXtst zlib -y
java和maven:之前已经安装好,其中java是在root用户下,maven是在普通用户下。
安装protobuf
tar -zxvf protobuf-2.5.0.tar.gz -C /opt/
cd /opt/
cd protobuf-2.5.0/
./configure &&
make &&
make check &&
make install
ldconfig
验证是否成功 : protoc --version测试是否安装成功
安装cmake
tar -zxvf cmake-3.20.2.tar.gz -C /opt/
cd /opt/cmake-3.20.2/
./configure
make && make install
ldconfig
验证是否成功 :cmake --version测试是否安装成功
安装findbugs
tar -zxvf findbugs-3.0.1.tar.gz\?download -C /opt/
安装apache-ant
tar -zxvf apache-ant-1.10.12-bin.tar.gz -C /opt/
配置环境变量 /etc/profile
export PROTOBUF_HOME=/opt/protobuf-2.5.0
export ANT_HOME=/opt/apache-ant-1.10.12
export CMAKE_HOME=/opt/cmake-3.20.2
export FIND_BUGS_HOME=/opt/findbugs-3.0.1
export PATH=$PROTOBUF_HOME:$CMAKE_HOME/bin:$ANT_HOME/bin:$FIND_BUGS_HOME/bin:$PATH
3. 编译hadoop-3.2.2
进入普通用户,进入源码目录,执行编译命令:
tar -zxvf hadoop-3.2.2-src.tar.gz -C ~/sourcecode/
cd hadoop-3.2.2-src/
mvn clean package -DskipTests -Pdist,native -Dtar
运行结果:
编译成功后,会在hadoop-dist/target/目录下有个hadoop-3.2.2.tar.gz文件。
hadoop-3.2.2.tar.gz如何部署,这里不再叙述。
二、支持lzo压缩
之前写了一篇,可参考:
Hadoop本身是不支持lzo压缩的,用hadoop checknative可以检查,是没有lzo的:
[ruoze@hadoop001 hadoop]$ hadoop checknative
2022-01-27 13:41:59,618 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
2022-01-27 13:41:59,620 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2022-01-27 13:41:59,623 WARN zstd.ZStandardCompressor: Error loading zstandard native libraries: java.lang.InternalError: Cannot load libzstd.so.1 (libzstd.so.1: cannot open shared object file: No such file or directory)!
2022-01-27 13:41:59,629 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable
2022-01-27 13:41:59,684 INFO nativeio.NativeIO: The native code was built without PMDK support.
Native library checking:
hadoop: true /home/ruoze/app/hadoop-3.2.2/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
zstd : false
snappy: true /lib64/libsnappy.so.1
lz4: true revision:10301
bzip2: true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
ISA-L: false libhadoop was built without ISA-L support
PMDK: false The native code was built without PMDK support.
1. Linux上安装LZO和LZOP
这里已经安装好,没有安装的自行安装:
[ruoze@hadoop001 data]$ which lzop
/bin/lzop
lzo压缩:lzop -v file
lzo解压:lzop -dv file
2. 获取hadoop-lzo源码,并解压
wget https://github.com/twitter/hadoop-lzo/archive/master.zip
unzip master.zip -d ~/sourcecode/
[ruoze@hadoop001 sourcecode]$ cd hadoop-lzo-master/
[ruoze@hadoop001 hadoop-lzo-master]$ ll
total 68
-rw-rw-r--. 1 ruoze ruoze 35151 Mar 5 2021 COPYING
-rw-rw-r--. 1 ruoze ruoze 19760 Mar 5 2021 pom.xml
-rw-rw-r--. 1 ruoze ruoze 10179 Mar 5 2021 README.md
drwxrwxr-x. 2 ruoze ruoze 34 Mar 5 2021 scripts
drwxrwxr-x. 4 ruoze ruoze 28 Mar 5 2021 src
[ruoze@hadoop001 hadoop-lzo-master]$
修改pom.xml,与自己的hadoop版本一致
[ruoze@hadoop001 hadoop-lzo-master]$ vi pom.xml
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>3.2.2</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
root用户下执行:
yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
yum install -y git
2. 编译hadoop-lzo
返回普通用户,进入hadoop-lzo-master目录,执行编译命令:
mvn clean package -Dmaven.test.skip=true
把target目录下的hadoop-lzo-0.4.21-SNAPSHOT.jar包拷贝到$HADOOP_HOME/share/hadoop/common/下面
cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
3. 配置Hadoop的core-site.xml和mapred-site.xml文件
core-site.xml添加如下的代码:
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
#主要是配置com.hadoop.compression.lzo.LzoCodec、com.hadoop.compression.lzo.LzopCodec压缩类
#io.compression.codec.lzo.class必须指定为LzoCodec非LzopCodec,不然压缩后的文件不会支持分片的
mapred-site.xml添加如下的代码:
#中间阶段的压缩
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
#最终阶段的压缩
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
core-site.xml 跟 mapred-site.xml 这两个文件如果是集群机器,也要同步修改,然后启动集群。
4. 测试hadoop-3.2.2是否支持lzo压缩
这里有一个造的word数据,616M,使用lzo压缩后是190M.
[ruoze@hadoop001 data]$ lzop -v makedatawordcount.txt
compressing makedatawordcount.txt into makedatawordcount.txt.lzo
[ruoze@hadoop001 data]$ ll -h
-rw-r--r--. 1 ruoze ruoze 616M Jan 27 10:39 makedatawordcount.txt
-rw-r--r--. 1 ruoze ruoze 190M Jan 27 10:39 makedatawordcount.txt.lzo
[ruoze@hadoop001 data]$ hdfs dfs -put makedatawordcount.txt.lzo /data/
190M大于一个块的大小128M。这样可以模拟后面hadoop支持LZO压缩文件的分片的过程。
测试一:
在core-site、mapred-site配置文件中,没有配置lzo之前,使用如下命令,看到的是乱码:
hdfs dfs -text /data/makedatawordcount.txt.lzo
两个配置文件配置了之后,再使用上面命令,就不是乱码了,说明hadoop-3.2.2配置了lzo之后,是支持lzo压缩的。
测试二:
[ruoze@hadoop001 hadoop]$ find ./ -name *example*.jar
./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar
./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-test-sources.jar
./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-sources.jar
[ruoze@hadoop001 hadoop]$
[ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /data/makedatawordcount.txt.lzo /output
2022-01-27 13:25:58,656 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2022-01-27 13:25:58,973 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0001
2022-01-27 13:25:59,108 INFO input.FileInputFormat: Total input files to process : 1
2022-01-27 13:25:59,122 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:25:59,123 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:25:59,999 INFO mapreduce.JobSubmitter: number of splits:1
2022-01-27 13:26:00,528 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0001
2022-01-27 13:26:00,529 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-27 13:26:00,696 INFO conf.Configuration: resource-types.xml not found
2022-01-27 13:26:00,696 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-27 13:26:00,854 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0001
2022-01-27 13:26:00,920 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0001/
2022-01-27 13:26:00,921 INFO mapreduce.Job: Running job: job_1643252767325_0001
2022-01-27 13:26:06,005 INFO mapreduce.Job: Job job_1643252767325_0001 running in uber mode : false
2022-01-27 13:26:06,006 INFO mapreduce.Job: map 0% reduce 0%
2022-01-27 13:26:22,142 INFO mapreduce.Job: map 38% reduce 0%
2022-01-27 13:26:28,166 INFO mapreduce.Job: map 56% reduce 0%
2022-01-27 13:26:33,234 INFO mapreduce.Job: map 100% reduce 0%
2022-01-27 13:26:40,280 INFO mapreduce.Job: map 100% reduce 100%
2022-01-27 13:26:40,289 INFO mapreduce.Job: Job job_1643252767325_0001 completed successfully
2022-01-27 13:26:40,341 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=46027310
FILE: Number of bytes written=51351200
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=162500380
HDFS: Number of bytes written=1639998
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=25572
Total time spent by all reduces in occupied slots (ms)=3491
Total time spent by all map tasks (ms)=25572
Total time spent by all reduce tasks (ms)=3491
Total vcore-milliseconds taken by all map tasks=25572
Total vcore-milliseconds taken by all reduce tasks=3491
Total megabyte-milliseconds taken by all map tasks=26185728
Total megabyte-milliseconds taken by all reduce tasks=3574784
Map-Reduce Framework
Map input records=13100000
Map output records=13100000
Map output bytes=567665192
Map output materialized bytes=4853171
Input split bytes=117
Combine input records=17818436
Combine output records=5249877
Reduce input groups=531441
Reduce shuffle bytes=4853171
Reduce input records=531441
Reduce output records=531441
Spilled Records=5781318
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=214
CPU time spent (ms)=31570
Physical memory (bytes) snapshot=763604992
Virtual memory (bytes) snapshot=5603123200
Total committed heap usage (bytes)=698351616
Peak Map Physical memory (bytes)=505016320
Peak Map Virtual memory (bytes)=2800398336
Peak Reduce Physical memory (bytes)=258588672
Peak Reduce Virtual memory (bytes)=2802724864
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=162500263
File Output Format Counters
Bytes Written=1639998
[ruoze@hadoop001 hadoop]$
[ruoze@hadoop001 hadoop]$ hdfs dfs -text /output/*
2022-01-27 13:40:55,134 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:40:55,137 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:40:55,140 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
2022-01-27 13:40:55,141 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
Apple 10663851
China 10664268
WORD 10674905
beijing 10667757
bigdata 10666953
china 10662630
shanghai 10666100
word 10670000
world 10663536
由测试二可知,输入是lzo压缩文件,文件里面是单词,跑了一个wc案例,得到了一个结果,说明对Hadoop配置了lzo之后,就可以支持压缩了。
但是从上面wc过程可以看到虽然lzo压缩文件是190M,大于一个块的大小,但是分片仍然是1:number of splits:1,说明还不支持分片。
那么怎么支持分片?
过程如下:
对上面的lzo文件用hadoop-lzo-0.4.21-SNAPSHOT.jar这个jar包建立索引,
[ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/
Found 1 items
-rw-r--r-- 1 ruoze supergroup 198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo
[ruoze@hadoop001 hadoop]$
[ruoze@hadoop001 hadoop]$ hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.LzoIndexer /data/makedatawordcount.txt.lzo
2022-01-27 13:50:01,977 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
2022-01-27 13:50:01,978 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f]
2022-01-27 13:50:02,422 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /data/makedatawordcount.txt.lzo, size 0.18 GB...
2022-01-27 13:50:02,683 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.26 seconds (725.23 MB/s). Index size is 19.23 KB.
[ruoze@hadoop001 hadoop]$
[ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/
Found 2 items
-rw-r--r-- 1 ruoze supergroup 198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo
-rw-r--r-- 1 ruoze supergroup 19696 2022-01-27 13:50 /data/makedatawordcount.txt.lzo.index
以看到在同级目录下生成了 makedatawordcount.lzo.index索引文件。
再跑一次wordcount,可以看到分片数是2个了:number of splits:2
[ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar \
> wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /data/makedatawordcount.txt.lzo /output
2022-01-27 14:11:11,665 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2022-01-27 14:11:11,999 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0007
2022-01-27 14:11:12,529 INFO input.FileInputFormat: Total input files to process : 1
2022-01-27 14:11:13,450 INFO mapreduce.JobSubmitter: number of splits:2
2022-01-27 14:11:13,550 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0007
2022-01-27 14:11:13,551 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-01-27 14:11:13,645 INFO conf.Configuration: resource-types.xml not found
2022-01-27 14:11:13,646 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-01-27 14:11:13,687 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0007
2022-01-27 14:11:13,722 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0007/
2022-01-27 14:11:13,722 INFO mapreduce.Job: Running job: job_1643252767325_0007
2022-01-27 14:11:17,785 INFO mapreduce.Job: Job job_1643252767325_0007 running in uber mode : false
2022-01-27 14:11:17,785 INFO mapreduce.Job: map 0% reduce 0%
2022-01-27 14:11:33,980 INFO mapreduce.Job: map 66% reduce 0%
2022-01-27 14:11:40,066 INFO mapreduce.Job: map 73% reduce 0%
2022-01-27 14:11:46,108 INFO mapreduce.Job: map 81% reduce 0%
2022-01-27 14:11:48,119 INFO mapreduce.Job: map 100% reduce 0%
2022-01-27 14:11:49,125 INFO mapreduce.Job: map 100% reduce 100%
2022-01-27 14:11:51,144 INFO mapreduce.Job: Job job_1643252767325_0007 completed successfully
2022-01-27 14:11:51,196 INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=4992
FILE: Number of bytes written=711952
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=198561903
HDFS: Number of bytes written=137
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Killed map tasks=1
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=54295
Total time spent by all reduces in occupied slots (ms)=11774
Total time spent by all map tasks (ms)=54295
Total time spent by all reduce tasks (ms)=11774
Total vcore-milliseconds taken by all map tasks=54295
Total vcore-milliseconds taken by all reduce tasks=11774
Total megabyte-milliseconds taken by all map tasks=55598080
Total megabyte-milliseconds taken by all reduce tasks=12056576
Map-Reduce Framework
Map input records=16000000
Map output records=96000000
Map output bytes=1013322815
Map output materialized bytes=260
Input split bytes=234
Combine input records=96000279
Combine output records=297
Reduce input groups=9
Reduce shuffle bytes=260
Reduce input records=18
Reduce output records=9
Spilled Records=432
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=676
CPU time spent (ms)=51250
Physical memory (bytes) snapshot=1249132544
Virtual memory (bytes) snapshot=8401858560
Total committed heap usage (bytes)=1059061760
Peak Map Physical memory (bytes)=506130432
Peak Map Virtual memory (bytes)=2801958912
Peak Reduce Physical memory (bytes)=239939584
Peak Reduce Virtual memory (bytes)=2801532928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=198561669
File Output Format Counters
Bytes Written=137
[ruoze@hadoop001 hadoop]$
总结:需要用如下命令对lzo文件建索引,才能支持分片:
hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.DistributedLzoIndexer /data/makedatawordcount.txt.lzo
或者
hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
> com.hadoop.compression.lzo.LzoIndexer /data/makedatawordcount.txt.lzo
并且用的时候还要加:参数: -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat
如:
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar \
> wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /data/makedatawordcount.txt.lzo /output
如果不加的话分片数还是1。
总结:Hadoop 的native库不支持lzo压缩文件,查看lzo文件也是乱码,需要编译hadoop-lzo,然后把jar包放到hadoop中,并且修改core-site.xml文件,添加lzo相关配置,才能支持lzo文件。并且本身不支持对lzo文件的分片,需要对lzo文件创建了索引之后才能支持分片。