Hadoop支持lzo压缩且支持分片

一、支持lzo压缩

  1. 安装 lzop native library
[root@bigdata ~]#  yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool
[root@bigdata ~]# wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
[root@bigdata ~]#  tar -zxvf lzo-2.06.tar.gz
[root@bigdata ~]#  cd lzo-2.06
[root@bigdata lzo-2.06]#  export CFLAGS=-m64
[root@bigdata lzo-2.06]#  ./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
[root@bigdata lzo-2.06]# make && sudo make install
  1. 安装hadoop-lzo
[root@bigdata ~]#  wget https://github.com/twitter/hadoop-lzo/archive/master.zip
下载后的文件名是master,它是一个zip格式的压缩包,可以进行解压:
[root@bigdata ~]# yum install unzip
[root@bigdata ~]#  unzip master
解压后的文件夹名为hadoop-lzo-master
当然,如果你电脑安装了git,你也可以用下面的命令去下载
[root@localhost ~]#  git clone https://github.com/twitter/hadoop-lzo.git


修改pom文件,将hadoop.current.version设置为自己的hadoop版本
[root@bigdata hadoop-lzo-master]# vi pom.xml 
<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.6.0-cdh5.15.1</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

[root@bigdata ~]# export CFLAGS=-m64
[root@bigdata ~]# export CXXFLAGS=-m64
[root@bigdata ~]# export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
[root@bigdata ~]# export LIBRARY_PATH=/usr/local/hadoop/lzo/lib
[root@bigdata ~]# mvn clean package -Dmaven.test.skip=true
[root@bigdata ~]# cd target/native/Linux-amd64-64
[root@bigdata Linux-amd64-64]# tar -cBf - -C lib . | tar -xBvf - -C ~
[root@bigdata ~]# cp ~/libgplcompression* $HADOOP_HOME/lib/native/

# 这个地方生成的jar文件,版本号可能不一样,填写正确的版本号
[root@bigdata target]# ll
total 428
drwxr-xr-x 2 root root   4096 Sep 24 16:20 antrun
drwxr-xr-x 4 root root   4096 Sep 24 16:21 apidocs
drwxr-xr-x 5 root root     73 Sep 24 16:20 classes
drwxr-xr-x 3 root root     24 Sep 24 16:20 generated-sources
-rw-r--r-- 1 root root 188771 Sep 24 16:21 hadoop-lzo-0.4.21-SNAPSHOT.jar
-rw-r--r-- 1 root root 180851 Sep 24 16:21 hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
-rw-r--r-- 1 root root  52041 Sep 24 16:21 hadoop-lzo-0.4.21-SNAPSHOT-sources.jar
drwxr-xr-x 2 root root     69 Sep 24 16:21 javadoc-bundle-options
drwxr-xr-x 2 root root     27 Sep 24 16:21 maven-archiver
drwxr-xr-x 3 root root     27 Sep 24 16:20 native
drwxr-xr-x 3 root root     17 Sep 24 16:20 test-classes
[root@bigdata hadoop-lzo-master]# cp  target/hadoop-lzo-0.4.21-SNAPSHOT.jar
$HADOOP_HOME/share/hadoop/common/
[root@bigdata hadoop-lzo-master]# cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib
  1. 修改hadoop配置文件
# hadoop-env.sh
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib

# core-site.xml
<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
           org.apache.hadoop.io.compress.DefaultCodec,
           com.hadoop.compression.lzo.LzoCodec,
           com.hadoop.compression.lzo.LzopCodec,
           org.apache.hadoop.io.compress.BZip2Codec
        </value>
</property>
<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

#mapred-site.xml
<property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
</property>

<property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

<property>
    <name>mapred.child.env</name>
    <value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
</property>

  1. 执行MapReduce程序
# 将文件使用lzo格式压缩
[root@bigdata usr]# yum install lzop
[hadoop@bigdata hadoop]$ lzop 300M.file


[hadoop@bigdata mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar wordcount /data/300M.file.lzo /tmp/lzo8
19/09/24 19:35:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/24 19:35:07 INFO input.FileInputFormat: Total input paths to process : 1
19/09/24 19:35:07 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/09/24 19:35:07 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 5dbdddb8cfb544e58b4e0b9664b9d1b66657faf5]
19/09/24 19:35:07 INFO mapreduce.JobSubmitter: number of splits:1
19/09/24 19:35:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569313774824_0009
19/09/24 19:35:08 INFO impl.YarnClientImpl: Submitted application application_1569313774824_0009
19/09/24 19:35:08 INFO mapreduce.Job: The url to track the job: http://bigdata:8088/proxy/application_1569313774824_0009/
19/09/24 19:35:08 INFO mapreduce.Job: Running job: job_1569313774824_0009
19/09/24 19:35:13 INFO mapreduce.Job: Job job_1569313774824_0009 running in uber mode : false
19/09/24 19:35:13 INFO mapreduce.Job:  map 0% reduce 0%

可以看到300M的文件,本来应该有三个分片,但是这里只有一个,所以我们需要对文件创建索引,使其支持分片

二、创建索引,使lzo文件输入时可以分片

  1. 方法一
[hadoop@bigdata hadoop]$ hadoop jar share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer [filepath]
  1. 方法二
[hadoop@bigdata hadoop]$ hadoop jar share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer [filepath]
  1. 此时还是不能分片,再添加-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat
[hadoop@bigdata mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat /data/lzo/lzo1.lzo /data/lzoout

19/09/25 19:19:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/25 19:19:48 INFO input.FileInputFormat: Total input paths to process : 1
19/09/25 19:19:50 INFO mapreduce.JobSubmitter: number of splits:33
19/09/25 19:19:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569410206197_0001
19/09/25 19:19:51 INFO impl.YarnClientImpl: Submitted application application_1569410206197_0001
19/09/25 19:19:51 INFO mapreduce.Job: The url to track the job: http://bigdata:8088/proxy/application_1569410206197_0001/
19/09/25 19:19:51 INFO mapreduce.Job: Running job: job_1569410206197_0001

ok~~

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值