Hive支持处理lzo压缩格式的数据统计查询

Hive支持处理lzo压缩格式的数据统计查询之前需要编译hadoop-3.2.2源码并支持lzo压缩,参考之前博客:
编译hadoop-3.2.2源码并支持lzo压缩
以上是前提。

数据准备

这里准备了一份379M的数据makedatatest.txt,用lzo压缩命令把文件压缩成makedatatest.txt.lzo,大小为158M,hdfs块大小为128M。

[ruoze@hadoop001 data]$ lzop -v makedatatest.txt 
compressing makedatatest.txt into makedatatest.txt.lzo
[ruoze@hadoop001 data]$ ll -h |grep makedatatest
-rw-r--r--. 1 ruoze ruoze 379M Jan 27 22:17 makedatatest.txt
-rw-r--r--. 1 ruoze ruoze 158M Jan 27 22:17 makedatatest.txt.lzo
[ruoze@hadoop001 data]$ 
[ruoze@hadoop001 data]$ 
[ruoze@hadoop001 data]$ tail makedatatest.txt
9999991,Role76,85,Nokia,11,2016-9-16
9999992,Role30,85,Apple,13,2016-6-15
9999993,Role70,85,Oppo,2,2016-11-27
9999994,Role16,29,Meizu,9,2016-8-12
9999995,Role50,46,Samsung,6,2016-5-15
9999996,Role17,85,Huawei,10,2016-11-30
9999997,Role19,55,Samsung,7,2016-7-24
9999998,Role47,85,Oppo,15,2016-7-2
9999999,Role9,54,Meizu,17,2016-10-1
10000000,Role12,67,Apple,0,2016-2-24
[ruoze@hadoop001 data]$ 
测试验证

启动hive,并创建hive表,把以上数据导入到表里面,然后进行测试。

set hive.cli.print.current.db=true;
set hive.cli.print.header=true

#创建表makedatatest_lzo,指定输入输出格式
create table makedatatest_lzo
(id string,
name string,
dataage string,
phonePlus string,
clicks string,
logintime string
)row format delimited fields terminated by ','
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

#加载数据
load data local inpath '/home/ruoze/data/makedatatest.txt.lzo' overwrite into table makedatatest_lzo ;

hive (study_hive)> 
                 > select * from makedatatest_lzo limit 5;
OK
1	Role5	29	MI	4	2016-7-22
2	Role35	37	Oppo	6	2016-1-2
3	Role43	40	Huawei	7	2016-4-7
4	Role16	46	Huawei	14	2016-1-28
5	Role65	33	MI	1	2016-2-26
Time taken: 0.108 seconds, Fetched: 5 row(s)
hive (study_hive)> 

执行统计查询语句:

hive (study_hive)> 
                 > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;
Query ID = ruoze_20220211104155_a2e3fd88-dc35-4cb2-b544-792e3197dac0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1644546089342_0001, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0001/
Kill Command = /home/ruoze/app/hadoop/bin/mapred job  -kill job_1644546089342_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-02-11 10:43:05,200 Stage-1 map = 0%,  reduce = 0%
2022-02-11 10:43:19,599 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.54 sec
2022-02-11 10:43:23,682 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.03 sec
MapReduce Total cumulative CPU time: 12 seconds 30 msec
Ended Job = job_1644546089342_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 12.03 sec   HDFS Read: 165057516 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 30 msec
OK
Apple	1375637
Huawei	1374779
Lenovo	1375233
MI	1375460
Meizu	1373821
Nokia	1373732
Oppo	1375440
Samsung	1375898
Time taken: 90.11 seconds, Fetched: 8 row(s)
hive (study_hive)> 

makedatatest.txt.lzo,大小为158M,hdfs块大小为128M。
以上可以看到,number of mappers: 1; number of reducers: 1,分片数为1。
然后执行以下命令,对hive表的数据文件makedatatest.txt.lzo 建立lzo索引:

hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.LzoIndexer \
/user/hive/warehouse/study_hive.db/makedatatest_lzo/makedatatest.txt.lzo

创建完成后,会在同目录出现一个makedatatest.txt.lzo.index索引文件。

然后需要设置参数:hive.input.format,再执行统计查询语句,
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

hive (study_hive)> 
                 > set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive (study_hive)> 
                 > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive (study_hive)> 
                 > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;
Query ID = ruoze_20220211105120_93f2811d-09ba-4999-a949-1eded05b0000
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1644546089342_0003, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0003/
Kill Command = /home/ruoze/app/hadoop/bin/mapred job  -kill job_1644546089342_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2022-02-11 10:52:13,070 Stage-1 map = 0%,  reduce = 0%
2022-02-11 10:52:23,255 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.93 sec
2022-02-11 10:52:24,273 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.78 sec
2022-02-11 10:52:28,352 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 17.41 sec
MapReduce Total cumulative CPU time: 17 seconds 410 msec
Ended Job = job_1644546089342_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 17.41 sec   HDFS Read: 165172760 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 410 msec
OK
Apple	1375637
Huawei	1374779
Lenovo	1375233
MI	1375460
Meizu	1373821
Nokia	1373732
Oppo	1375440
Samsung	1375898
Time taken: 69.65 seconds, Fetched: 8 row(s)
hive (study_hive)> 
                 > 
                 > set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive (study_hive)> 

number of mappers: 2; number of reducers: 1,分片数变成2了。
因为,对hive表的数据文件makedatatest.txt.lzo 建立lzo索引,Hadoop 的native库不支持lzo压缩文件,查看lzo文件也是乱码,需要编译hadoop-lzo,然后把jar包放到hadoop中,并且修改core-site.xml文件,添加lzo相关配置,才能支持lzo文件。本身不支持对lzo文件的分片,需要对lzo文件创建了索引之后才能支持分片。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值