1 Hadoop 支持LZO压缩
1.1 安装maven依赖
wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
tar -zxvf apache-maven-3.1.1-bin.tar.gz
vim apache-maven-3.1.1/conf/settings.xml
添加加阿里云仓库
<!-- 阿里云仓库 -->
<mirror>
<id>alimaven</id>
<mirrorOf>central</mirrorOf>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
</mirror>
<!-- 中央仓库1 -->
<mirror>
<id>repo1</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo1.maven.org/maven2/</url>
</mirror>
<!-- 中央仓库2 -->
<mirror>
<id>repo2</id>
<mirrorOf>central</mirrorOf>
<name>Human Readable Name for this Mirror.</name>
<url>http://repo2.maven.org/maven2/</url>
</mirror>
1.2 编译LZO
安装lzo相关依赖,参考链接
[root@JD ~]# yum install -y svn ncurses-devel
[root@JD ~]# yum install -y gcc gcc-c++ make cmake
[root@JD ~]# yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
[root@JD ~]# yum install -y lzo lzo-devel lzop autoconf automake cmake
1 先lzo官网 下载lzo的最新版本
2 解压文件 tar -zxvf lzo-2.10.tar.gz
3 mkdir -p /user/local/lzo-2.10
4 cd lzo-2.10
5 配置 ./configure --enable-shared --prefix /usr/local/lzo-2.10
6 make && sudo make install
1.3 编译 hadoop-lzo
从twitter官网下载zip包,上传到服务器解压。
进入hadoop-lzo-master目录,依次执行下面的命令
1 修改pom文件中如下位置:pom.xml
<repositories>
#添加cloudera仓库
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
#因为用的是cdh的
<hadoop.current.version>2.6.0-cdh5.15.1</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
2 执行如下命令
export CFLAGS=-m64
export CXXFLAGS=-m64
export C_INCLUDE_PATH=/user/local/lzo-2.10/include
export LIBRARY_PATH=/user/local/lzo-2.10/lib
mvn clean package -Dmaven.test.skip=true
cd target/native/Linux-amd64-64/
tar -cBf - -C lib . | tar -xBvf - -C ~
cp ~/libgplcompression.* ${HADOOP_HOME}/lib/native/
cp ../../hadoop-lzo-0.4.21-SNAPSHOT ${HADOOP_HOME}/share/hadoop/common/
进入到${HADOOP_HOME}/share/hadoop/common/
目录将hadoop-lzo-0.4.21-SNAPSHOT.jar
分发到集群其他相同目录下xsync hadoop-lzo-0.4.21-SNAPSHOT.jar
2 修改core-site.xml
增加配置支持LZO压缩
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
同步core-site.xml
到集群其他节点
1.4 测试集群中的LZO压缩
安装LZOP
1、下载并解压
wget http://www.lzop.org/download/lzop-1.04.tar.gz
tar -zxvf lzop-1.04.tar.gz
2、进入解压后目录,并编译安装
cd /opt/software/lzop-1.04
export C_INCLUDE_PATH=/user/local/lzo-2.10/include/
./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make && make install
3、将lzop复制到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop
4、测试lzop压缩
lzop nginx.log
出现后缀nginx.log.lzo即表明是lzo压缩成功
1.5 LZO创建索引
1)创建 LZO 文件的索引,LZO 压缩文件的可切片特性依赖于其索引,故我们需要手动为 LZO 压缩文件创建索引。若无索引,则 LZO 文件的切片只有一个。
hadoop jar ${HADOOP_HOME}/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer nginx.log.lzo
2)测试
(1)将 nginx.log.lzo上传到集群的根目录
hdfs dfs -mkdir /test
hdfs dfs -put nginx.log.lzo /test
(2)执行 wordcount 程序
hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapredu ce-examples-2.7.2.jar wordcount /test /output1
(3)对上传的 LZO 文件建索引
hadoop jar /export/servers/hadoop-2.6.0-cdh5.14.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /test/nginx.log.lzo
(4)再次执行 WordCount 程序
hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapredu ce-examples-2.7.2.jar wordcount /test /output1
1)先下载lzo的jar项目
https://github.com/twitter/hadoop-lzo/archive/master.zip
2)下载后的文件名是hadoop-lzo-master,它是一个zip格式的压缩包,先进行解压,然后用maven编译。生成hadoop-lzo-0.4.20.jar。
3)将编译好后的hadoop-lzo-0.4.20.jar 放入hadoop-2.7.2/share/hadoop/common/
[johnny@hadoop102 common]# yum -y install lzo-devel zlib-devel gcc autoconf automakelibtool
[johnny@hadoop102 common]$ pwd
/opt/module/hadoop-2.7.2/share/hadoop/common
[johnny@hadoop102 common]$ ls
hadoop-lzo-0.4.20.jar
4)同步hadoop-lzo-0.4.20.jar到hadoop103、hadoop104
[johnny@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar
5)core-site.xml增加配置支持LZO压缩
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>
5)同步core-site.xml到hadoop103、hadoop104
[johnny@hadoop102 hadoop]$ xsync core-site.xml
2 Hadoop支持Snappy压缩
Hadoop3.1.3+HBase-2.2.4设置snappy压缩
3 CDH版本支持TEZ
屏蔽如下模块,否则编译不通过
<modules>
<module>hadoop-shim</module>
<module>tez-api</module>
<module>tez-common</module>
<module>tez-runtime-library</module>
<module>tez-runtime-internals</module>
<module>tez-mapreduce</module>
<module>tez-examples</module>
<!-- <module>tez-tests</module>-->
<module>tez-dag</module>
<!--<module>tez-ext-service-tests</module>
<module>tez-ui</module>
<module>tez-ui2</module>-->
<module>tez-plugins</module>
<module>tez-tools</module>
<module>hadoop-shim-impls</module>
<module>tez-dist</module>
<module>docs</module>
</modules>