lzo测试
上篇文章hadoop lzo配置介绍lzo配置
本片测一下lzo使用
准备原始数据
压缩前1.4G lzop压缩后 213M,
[root@spark001 hadoop]# du -sh *
1.4G baidu.log
[root@spark001 hadoop]# lzop baidu.log
[root@spark001 hadoop]# du -sh *
1.4G baidu.log
213M baidu.log.lzo
上传hdfs
[root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M
[root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M/
lzo不分片
清洗测试
hadoop jar hadoop-train-1.0.jar com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo /user/hadoop/compress/log/200M/ /user/hadoop/compress/log/etl_lzo/200/
[root@spark001 hadoop]# hadoop jar hadoop-train-1.0.jar com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo /user/hadoop/compress/log/200M/ /user/hadoop/compress/log/etl_lzo/200/
19/04/15 17:06:34 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200/
19/04/15 17:06:34 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
19/04/15 17:06:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/15 17:06:35 INFO input.FileInputFormat: Total input paths to process : 1
19/04/15 17:06:35 INFO mapreduce.JobSubmitter: number of splits:1 只有1个分片 说明这种lzo 不支持分片
lzo支持分片
需要建立lzo索引
[root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M_index
[root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M_index/
创建索引
hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.13.1.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/user/hadoop/compress/log/200M_index
同目录下生成了一个index后缀的文件
[root@spark001 hadoop]# hdfs dfs -ls /user/hadoop/compress/log/200M_index/
Found 2 items
-rw-r--r-- 3 root supergroup 222877272 2019-04-15 16:00 /user/hadoop/compress/log/200M_index/baidu.log.lzo
-rw-r--r-- 3 root supergroup 43384 2019-04-15 16:09 /user/hadoop/compress/log/200M_index/baidu.log.lzo.index
执行etl操作
[root@spark001 hadoop]# hadoop jar hadoop-train-1.0.jar com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo /user/hadoop/compress/log/200M_index/ /user/hadoop/compress/log/etl_lzo/200_index/
19/04/15 17:10:09 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200_index/
19/04/15 17:10:09 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
19/04/15 17:10:09 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/15 17:10:09 INFO input.FileInputFormat: Total input paths to process : 2
19/04/15 17:10:10 INFO mapreduce.JobSubmitter: number of splits:2 2个分片
说明想要lzo支持分片需要创建索引
总结
213M支持分片的话应该有 2个split,不支持就一个split,
支持分片当大于blocksize时,会有2个map处理,提高效率
不支持分片不论多大都只有一个 map处理 耗费时间
所以工作中使用lzo要合理控制生成的lzo大小,不要超过一个block大小。因为如果没有lzo的index文件,该lzo会由一个map处理。如果lzo过大, 会导致某个map处理时间过长。
也可以配合lzo.index文件使用,这样就支持split,好处是文件大小不受限制,可以将文件设置的稍微大点,这样有利于减少文件数目。但是生成lzo.index文件虽然占空间不大但也本身需要开销。