为何要使用压缩,压缩可以是文件的大小减小很多,节省空间;另外压缩后的文件在传输时更节省带宽。
所需软件:
1)lzo
2)hadoop-lzo
3)maven
安装编译:
1)lzo
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
tar zxvf lzo-2.06.tar.gz
export CFLAGS=-m64
./configure -enable-shared -prefix=/opt/compress/lzo-2.06
make && make install
2)maven(略)
3)hadoop-lzo
修改pom文件
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.3.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.3.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
export CFLAGS=-m64
export CXXFLAGS=-m64
export C_INCLUDE_PATH=/opt/modules/lzo/include
export LIBRARY_PATH=/opt/modules/lzo/lib
/opt/modules/apache-maven-3.2.5/bin/mvn clean package -Dmaven.test.skip=true
cd target/native/Linux-amd64-64
tar -cBf - -C lib . | tar -xBvf - -C ~
mv ~/libgplcompression* $HADOOP_HOME/lib/native/
cp target/hadoop-lzo-0.4.18-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
4)最终每台机器上要有【在$HADOOP_HOME/lib/native/下】
① 动态库文件
libgplcompression.a
libgplcompression.la
libgplcompression.so -> libgplcompression.so.0.0.0
libgplcompression.so.0 -> libgplcompression.so.0.0.0
libgplcompression.so.0.0.0
② 动态库文件需要头文件等,配置压缩也需要用到lib文件,故编译生成的压缩文件也需要
include
lib
share
lib中
liblzo2.a
liblzo2.la
liblzo2.so -> liblzo2.so.2.0.0
liblzo2.so.2 -> liblzo2.so.2.0.0
liblzo2.so.2.0.0
5)配置压缩
hadoop-env.sh
export LD_LIBRARY_PATH=/opt/modules/lzo/lib
core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
mapred-site.xml
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/opt/modules/lzo/lib</value>
</property>
6)hadoop压缩验证
上传压缩文件到hdfs,运行单词计数程序
15/11/06 16:53:39 INFO client.RMProxy: Connecting to ResourceManager at dev138/192.168.3.138:8032
15/11/06 16:53:40 INFO input.FileInputFormat: Total input paths to process : 1
15/11/06 16:53:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
15/11/06 16:53:40 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 123cbfa7726e887899295cd459acc6937d6f008f]
15/11/06 16:53:40 INFO mapreduce.JobSubmitter: number of splits:1
15/11/06 16:53:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446798050907_0002
15/11/06 16:53:41 INFO impl.YarnClientImpl: Submitted application application_1446798050907_0002
15/11/06 16:53:41 INFO mapreduce.Job: The url to track the job: http://dev138:8088/proxy/application_1446798050907_0002/
15/11/06 16:53:41 INFO mapreduce.Job: Running job: job_1446798050907_0002
15/11/06 16:53:48 INFO mapreduce.Job: Job job_1446798050907_0002 running in uber mode : false
15/11/06 16:53:48 INFO mapreduce.Job: map 0% reduce 0%
15/11/06 16:53:56 INFO mapreduce.Job: map 100% reduce 0%
15/11/06 16:54:05 INFO mapreduce.Job: map 100% reduce 100%
15/11/06 16:54:05 INFO mapreduce.Job: Job job_1446798050907_0002 completed successfully
15/11/06 16:54:05 INFO mapreduce.Job: Counters: 49
15/11/06 16:53:40 INFO input.FileInputFormat: Total input paths to process : 1
15/11/06 16:53:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
15/11/06 16:53:40 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 123cbfa7726e887899295cd459acc6937d6f008f]
15/11/06 16:53:40 INFO mapreduce.JobSubmitter: number of splits:1
15/11/06 16:53:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1446798050907_0002
15/11/06 16:53:41 INFO impl.YarnClientImpl: Submitted application application_1446798050907_0002
15/11/06 16:53:41 INFO mapreduce.Job: The url to track the job: http://dev138:8088/proxy/application_1446798050907_0002/
15/11/06 16:53:41 INFO mapreduce.Job: Running job: job_1446798050907_0002
15/11/06 16:53:48 INFO mapreduce.Job: Job job_1446798050907_0002 running in uber mode : false
15/11/06 16:53:48 INFO mapreduce.Job: map 0% reduce 0%
15/11/06 16:53:56 INFO mapreduce.Job: map 100% reduce 0%
15/11/06 16:54:05 INFO mapreduce.Job: map 100% reduce 100%
15/11/06 16:54:05 INFO mapreduce.Job: Job job_1446798050907_0002 completed successfully
15/11/06 16:54:05 INFO mapreduce.Job: Counters: 49