Hive支持处理lzo压缩格式的数据统计查询

最新推荐文章于 2024-05-16 17:50:18 发布

liweihope

最新推荐文章于 2024-05-16 17:50:18 发布

阅读量1.3k

点赞数

分类专栏： Hive 文章标签： hive hadoop hive支持lzo统计查询

本文链接：https://blog.csdn.net/liweihope/article/details/122874869

版权

Hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Hive支持处理lzo压缩格式的数据统计查询之前需要编译hadoop-3.2.2源码并支持lzo压缩，参考之前博客：
编译hadoop-3.2.2源码并支持lzo压缩
以上是前提。

数据准备

这里准备了一份379M的数据makedatatest.txt，用lzo压缩命令把文件压缩成makedatatest.txt.lzo，大小为158M，hdfs块大小为128M。

[ruoze@hadoop001 data]$ lzop -v makedatatest.txt 
compressing makedatatest.txt into makedatatest.txt.lzo
[ruoze@hadoop001 data]$ ll -h |grep makedatatest
-rw-r--r--. 1 ruoze ruoze 379M Jan 27 22:17 makedatatest.txt
-rw-r--r--. 1 ruoze ruoze 158M Jan 27 22:17 makedatatest.txt.lzo
[ruoze@hadoop001 data]$ 
[ruoze@hadoop001 data]$ 
[ruoze@hadoop001 data]$ tail makedatatest.txt
9999991,Role76,85,Nokia,11,2016-9-16
9999992,Role30,85,Apple,13,2016-6-15
9999993,Role70,85,Oppo,2,2016-11-27
9999994,Role16,29,Meizu,9,2016-8-12
9999995,Role50,46,Samsung,6,2016-5-15
9999996,Role17,85,Huawei,10,2016-11-30
9999997,Role19,55,Samsung,7,2016-7-24
9999998,Role47,85,Oppo,15,2016-7-2
9999999,Role9,54,Meizu,17,2016-10-1
10000000,Role12,67,Apple,0,2016-2-24
[ruoze@hadoop001 data]$

测试验证

启动hive，并创建hive表，把以上数据导入到表里面，然后进行测试。

set hive.cli.print.current.db=true;
set hive.cli.print.header=true

#创建表makedatatest_lzo，指定输入输出格式
create table makedatatest_lzo
(id string,
name string,
dataage string,
phonePlus string,
clicks string,
logintime string
)row format delimited fields terminated by ','
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

#加载数据
load data local inpath '/home/ruoze/data/makedatatest.txt.lzo' overwrite into table makedatatest_lzo ;

hive (study_hive)> 
                 > select * from makedatatest_lzo limit 5;
OK
1	Role5	29	MI	4	2016-7-22
2	Role35	37	Oppo	6	2016-1-2
3	Role43	40	Huawei	7	2016-4-7
4	Role16	46	Huawei	14	2016-1-28
5	Role65	33	MI	1	2016-2-26
Time taken: 0.108 seconds, Fetched: 5 row(s)
hive (study_hive)>

执行统计查询语句：

hive (study_hive)> 
                 > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;
Query ID = ruoze_20220211104155_a2e3fd88-dc35-4cb2-b544-792e3197dac0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1644546089342_0001, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0001/
Kill Command = /home/ruoze/app/hadoop/bin/mapred job  -kill job_1644546089342_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-02-11 10:43:05,200 Stage-1 map = 0%,  reduce = 0%
2022-02-11 10:43:19,599 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.54 sec
2022-02-11 10:43:23,682 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.03 sec
MapReduce Total cumulative CPU time: 12 seconds 30 msec
Ended Job = job_1644546089342_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 12.03 sec   HDFS Read: 165057516 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 30 msec
OK
Apple	1375637
Huawei	1374779
Lenovo	1375233
MI	1375460
Meizu	1373821
Nokia	1373732
Oppo	1375440
Samsung	1375898
Time taken: 90.11 seconds, Fetched: 8 row(s)
hive (study_hive)>

makedatatest.txt.lzo，大小为158M，hdfs块大小为128M。
以上可以看到，number of mappers: 1; number of reducers: 1，分片数为1。
然后执行以下命令，对hive表的数据文件makedatatest.txt.lzo 建立lzo索引：

hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.LzoIndexer \
/user/hive/warehouse/study_hive.db/makedatatest_lzo/makedatatest.txt.lzo

创建完成后，会在同目录出现一个makedatatest.txt.lzo.index索引文件。

然后需要设置参数：hive.input.format，再执行统计查询语句，
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

hive (study_hive)> 
                 > set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive (study_hive)> 
                 > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive (study_hive)> 
                 > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;
Query ID = ruoze_20220211105120_93f2811d-09ba-4999-a949-1eded05b0000
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1644546089342_0003, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0003/
Kill Command = /home/ruoze/app/hadoop/bin/mapred job  -kill job_1644546089342_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2022-02-11 10:52:13,070 Stage-1 map = 0%,  reduce = 0%
2022-02-11 10:52:23,255 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.93 sec
2022-02-11 10:52:24,273 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.78 sec
2022-02-11 10:52:28,352 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 17.41 sec
MapReduce Total cumulative CPU time: 17 seconds 410 msec
Ended Job = job_1644546089342_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 17.41 sec   HDFS Read: 165172760 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 410 msec
OK
Apple	1375637
Huawei	1374779
Lenovo	1375233
MI	1375460
Meizu	1373821
Nokia	1373732
Oppo	1375440
Samsung	1375898
Time taken: 69.65 seconds, Fetched: 8 row(s)
hive (study_hive)> 
                 > 
                 > set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
hive (study_hive)>

number of mappers: 2; number of reducers: 1，分片数变成2了。
因为，对hive表的数据文件makedatatest.txt.lzo 建立lzo索引，Hadoop 的native库不支持lzo压缩文件，查看lzo文件也是乱码，需要编译hadoop-lzo，然后把jar包放到hadoop中，并且修改core-site.xml文件，添加lzo相关配置，才能支持lzo文件。本身不支持对lzo文件的分片，需要对lzo文件创建了索引之后才能支持分片。

liweihope

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive支持处理lzo压缩格式的数据统计查询

Hive支持处理lzo压缩格式的数据统计查询之前需要编译hadoop-3.2.2源码并支持lzo压缩，参考之前博客：编译hadoop-3.2.2源码并支持lzo压缩以上是前提。数据准备这里准备了一份379M的数据makedatatest.txt，用lzo压缩命令把文件压缩成makedatatest.txt.lzo，大小为158M，hdfs块大小为128M。[ruoze@hadoop001 data]$ lzop -v makedatatest.txt compressing makedatate
复制链接

扫一扫