Hive表生成函数之---explode函数应用

通过explode+lateral view的具体实例讲述一下具体应用.

实例程序1:通过explode+lateral view实现word count

表结构:

CREATE EXTERNAL TABLE doc(
line string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION
‘hdfs://ns1012/user/mart_fro/tmp.db/doc’;
数据:

hive> dfs -cat hdfs://ns1012/user/mart_fro/tmp.db/doc/1.txt;
hello world1
hello world1
hello world2
hello world3
hello world4
hello world5
hive> select *
> from doc;
OK
hello world1
hello world1
hello world2
hello world3
hello world4
hello world5
Time taken: 0.361 seconds, Fetched: 6 row(s)

方法1:

select t1.column1,count(*) as cnt
from (
select column1
from doc lateral view explode(split(line, ’ ')) tmp as column1
) t1
group by t1.column1
运行结果:

column1 cnt
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1

方法2:

select column1,count(*) as cnt
from doc
lateral view explode(split(line,’ ')) tmp as column1
group by column1
运行结果:

column1 _c1
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1

方法3:窗口函数

select word,
word_cnt
from (
select word,
row_number() over w1 as word_rank,
count(1) over w1 as word_cnt
from doc lateral view explode(split(doc.line, ’ ')) tmp as word
window w1 as (distribute by word sort by word)
) t1
where t1.word_rank = 1
运行结果:

word word_cnt
hello 6
world1 2
world2 1
world3 1
world4 1
world5 1

实例程序2:获取调度系统中每个任务关联的父任务数量. (该程序需要注意null值的处理.)

表结构:

CREATE EXTERNAL TABLE buffalo_task_info(
task_id int COMMENT ‘任务id’,
task_name string COMMENT ‘任务名称’,
task_parents array COMMENT ‘父任务id’,
task_tags map<int,string> COMMENT ‘任务关联的标签信息’)
partitioned by(dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’ --字段与字段之间的分隔符.
COLLECTION ITEMS TERMINATED BY ‘,’ --集合项之间的分隔符.
MAP KEYS TERMINATED BY ‘:’ --Map的Key和Value之间已什么进行分割.
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
表数据:parent_info.txt

1001|exe_cool_data_task_1001|316882#3.5,156609#3.5,156597#3.5,317201#3.5,212815#4.0,212814#4.0|255:L1标签,1225:核心流量标签
1002|exe_cool_data_task_1002|283981#3.5,155054#3.5,155053#4.0|256:L2标签,1225:核心流量标签
1003|exe_cool_data_task_1003|148035#3.5,155082#3.5|257:L3标签,1225:核心流量标签
1004|exe_cool_data_task_1004|236687#3.5,236686#3.5,236688#4.0|255:L1标签,1225:核心流量标签
1005|exe_cool_data_task_1005|218292#3.5,302538#4.0|256:L2标签,1225:核心流量标签
1006|exe_cool_data_task_1006||
查看表数据:

hive> load data local inpath ‘/home/mart_fro/zhang/parent_info.txt’ overwrite into table buffalo_task_info partition(dt=‘2019-07-01’);
Loading data to table tmp.buffalo_task_info partition (dt=2019-07-01)
Partition tmp.buffalo_task_info{dt=2019-07-01} stats: [numFiles=1, numRows=0, totalSize=537, rawDataSize=0]
OK
Time taken: 0.527 seconds
hive> select *
> from buffalo_task_info;
OK
1001 exe_cool_data_task_1001 [“316882#3.5”,“156609#3.5”,“156597#3.5”,“317201#3.5”,“212815#4.0”,“212814#4.0”] {255:“L1标签”,1225:“核心流量标签”} 2019-07-01
1002 exe_cool_data_task_1002 [“283981#3.5”,“155054#3.5”,“155053#4.0”] {256:“L2标签”,1225:“核心流量标签”} 2019-07-01
1003 exe_cool_data_task_1003 [“148035#3.5”,“155082#3.5”] {257:“L3标签”,1225:“核心流量标签”} 2019-07-01
1004 exe_cool_data_task_1004 [“236687#3.5”,“236686#3.5”,“236688#4.0”] {255:“L1标签”,1225:“核心流量标签”} 2019-07-01
1005 exe_cool_data_task_1005 [“218292#3.5”,“302538#4.0”] {256:“L2标签”,1225:“核心流量标签”} 2019-07-01
1006 exe_cool_data_task_1006 [] {} 2019-07-01
Time taken: 0.089 seconds, Fetched: 6 row(s)

实例程序:

select t2.task_id,sum(cnt) as parent_num
from (
select task_id,
case when parent_id is null then 0 else 1 end as cnt
from (
select task_id,
task_parents
from buffalo_task_info
where dt = ‘2019-07-01’
) t1 lateral view outer explode(task_parents) tmp as parent_id
) t2
group by t2.task_id
运行结果:

task_id parent_num
1001 6
1002 3
1003 2
1004 3
1005 2
1006 0

优化代码:

select task_id,count(parent_id)
from task_info
lateral view outer explode(task_parents) temp as parent_id
group by task_id
运行结果:

1001 2
1002 2
1003 2
1004 2
1005 2
1006 0

同样,我们可以用窗口函数:

select t2.task_id,
t2.task_name,
t2.parent_cnt
from (
select t1.task_id,
t1.task_name,
row_number() over w1 as tmp_rank,
count(parent_id) over w1 as parent_cnt
from (
select distinct task_id,
task_name,
parent_id
from (
select task_id,
task_name,
task_parents
from buffalo_task_info
where dt = ‘2019-07-02’
) t1 lateral view outer explode(t1.task_parents) tmp as parent_id
) t1 window w1 as (distribute by t1.task_id sort by t1.task_id)
) t2
where t2.tmp_rank = 1
运行结果:

1001 exe_cool_data_task_1001 6
1002 exe_cool_data_task_1002 3
1003 exe_cool_data_task_1003 2
1004 exe_cool_data_task_1004 3
1005 exe_cool_data_task_1005 2
1006 exe_cool_data_task_1006 0

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一只懒得睁眼的猫

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值