通过explode+lateral view的具体实例讲述一下具体应用.
实例程序1:通过explode+lateral view实现word count
表结构:
CREATE EXTERNAL TABLE doc
(
line
string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘hdfs://ns1012/user/mart_fro/tmp.db/doc’;
数据:
hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/doc/1.txt;
hello world1
hello world1
hello world2
hello world3
hello world4
hello world5
hive> select *
> from doc;
OK
hello world1
hello world1
hello world2
hello world3
hello world4
hello world5
Time taken: 0.361 seconds, Fetched: 6 row(s)
方法1:
select t1.column1,count(*) as cnt
from (
select column1
from doc lateral view explode(split(line, ’ ')) tmp as column1
) t1
group by t1.column1
运行结果:
column1 cnt
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1
方法2:
select column1,count(*) as cnt
from doc
lateral view explode(split(line,’ ')) tmp as column1
group by column1
运行结果:
column1 _c1
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1
方法3:窗口函数
select word,
word_cnt
from (
select word,
row_number() over w1 as word_rank,
count(1) over w1 as word_cnt
from doc lateral view explode(split(doc.line, ’ ')) tmp as word
window w1 as (distribute by word sort by word)
) t1
where t1.word_rank = 1
运行结果:
word word_cnt
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1
实例程序2:获取调度系统中每个任务关联的父任务数量. (该程序需要注意null值的处理.)
表结构:
CREATE EXTERNAL TABLE buffalo_task_info
(
task_id
int COMMENT ‘任务id’,
task_name
string COMMENT ‘任务名称’,
task_parents
array COMMENT ‘父任务id’,
task_tags
map<int,string> COMMENT ‘任务关联的标签信息’)
partitioned by(dt
string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’ --字段与字段之间的分隔符.
COLLECTION ITEMS TERMINATED BY ‘,’ --集合项之间的分隔符.
MAP KEYS TERMINATED BY ‘:’ --Map的Key和Value之间已什么进行分割.
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
表数据:parent_info.txt
1001|exe_cool_data_task_1001|316882#3.5,156609#3.5,156597#3.5,317201#3.5,212815#4.0,212814#4.0|255:L1标签,1225:核心流量标签
1002|exe_cool_data_task_1002|283981#3.5,155054#3.5,155053#4.0|256:L2标签,1225:核心流量标签
1003|exe_cool_data_task_1003|148035#3.5,155082#3.5|257:L3标签,1225:核心流量标签
1004|exe_cool_data_task_1004|236687#3.5,236686#3.5,236688#4.0|255:L1标签,1225:核心流量标签
1005|exe_cool_data_task_1005|218292#3.5,302538#4.0|256:L2标签,1225:核心流量标签
1006|exe_cool_data_task_1006||
查看表数据:
hive> load data local inpath ‘/home/mart_fro/zhang/parent_info.txt’ overwrite into table buffalo_task_info partition(dt=‘2019-07-01’);
Loading data to table tmp.buffalo_task_info partition (dt=2019-07-01)
Partition tmp.buffalo_task_info{dt=2019-07-01} stats: [numFiles=1, numRows=0, totalSize=537, rawDataSize=0]
OK
Time taken: 0.527 seconds
hive> select *
> from buffalo_task_info;
OK
1001 exe_cool_data_task_1001 [“316882#3.5”,“156609#3.5”,“156597#3.5”,“317201#3.5”,“212815#4.0”,“212814#4.0”] {255:“L1标签”,1225:“核心流量标签”} 2019-07-01
1002 exe_cool_data_task_1002 [“283981#3.5”,“155054#3.5”,“155053#4.0”] {256:“L2标签”,1225:“核心流量标签”} 2019-07-01
1003 exe_cool_data_task_1003 [“148035#3.5”,“155082#3.5”] {257:“L3标签”,1225:“核心流量标签”} 2019-07-01
1004 exe_cool_data_task_1004 [“236687#3.5”,“236686#3.5”,“236688#4.0”] {255:“L1标签”,1225:“核心流量标签”} 2019-07-01
1005 exe_cool_data_task_1005 [“218292#3.5”,“302538#4.0”] {256:“L2标签”,1225:“核心流量标签”} 2019-07-01
1006 exe_cool_data_task_1006 [] {} 2019-07-01
Time taken: 0.089 seconds, Fetched: 6 row(s)
实例程序:
select t2.task_id,sum(cnt) as parent_num
from (
select task_id,
case when parent_id is null then 0 else 1 end as cnt
from (
select task_id,
task_parents
from buffalo_task_info
where dt = ‘2019-07-01’
) t1 lateral view outer explode(task_parents) tmp as parent_id
) t2
group by t2.task_id
运行结果:
task_id parent_num
1001 6
1002 3
1003 2
1004 3
1005 2
1006 0
优化代码:
select task_id,count(parent_id)
from task_info
lateral view outer explode(task_parents) temp as parent_id
group by task_id
运行结果:
1001 2
1002 2
1003 2
1004 2
1005 2
1006 0
同样,我们可以用窗口函数:
select t2.task_id,
t2.task_name,
t2.parent_cnt
from (
select t1.task_id,
t1.task_name,
row_number() over w1 as tmp_rank,
count(parent_id) over w1 as parent_cnt
from (
select distinct task_id,
task_name,
parent_id
from (
select task_id,
task_name,
task_parents
from buffalo_task_info
where dt = ‘2019-07-02’
) t1 lateral view outer explode(t1.task_parents) tmp as parent_id
) t1 window w1 as (distribute by t1.task_id sort by t1.task_id)
) t2
where t2.tmp_rank = 1
运行结果:
1001 exe_cool_data_task_1001 6
1002 exe_cool_data_task_1002 3
1003 exe_cool_data_task_1003 2
1004 exe_cool_data_task_1004 3
1005 exe_cool_data_task_1005 2
1006 exe_cool_data_task_1006 0