lateral view (侧视图) + explode 炸裂函数
简述
- 用于把array类型的字段,拆成x行
- lateral view函数
功能:
用于和UDTF函数(explode,split)结合使用,把某一行数据拆分成多行数据,再将【多行结果】组合成一个支持别名的【虚拟表】。
主要解决select使用UDTF做查询的过程中查询只能包含单个UDTF,不能包含其它字段以及多个UDTF的情况。
语法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)
columnAlias是给udtf(expression)列起的别名。
tableAlias 虚拟表的别名。
lateral view函数会将lUDTF生成的结果放到一个虚拟表中,然后这个虚拟表和【输入行】进行join来达到连接UDTF外的select字段的目的。
select waybill_no,category_name
from ky.dm_heavy_cargo.dm_customer_code_waybill_no_di
lateral view outer explode(service_prod_code) table_tmp as category_name --不可忽略虚拟表名哦
where inc_day = '20230730' limit 10
- 备注:
- lateral view需要在from后where前
- from后面可以有多个lateral view
- 生成的虚拟表名不可忽略
- 如果要拆分的字段有NULL,需要使用lateral view outer代替,避免数据缺失(未验证)
- explode() 里面是array会拆成一列多行、map会拆成对应的2列多行
- https://zhuanlan.zhihu.com/p/115918587
测试代码
-- 测试代码
with array_table as(
select 1 as id, array('apple','banana','pear') as items
union all
select 2 as id, array('watermelon','orange') as items
union all
select 3 as id, array(null) as items
)
-- explode()
-- id=3的空行会自动忽略掉,导致少行的情况
select explode(items) as item from array_table; -- hive可以跑、spark不行;结果就是一列,但原来的id没法挂到对应Item上
-- lateral view explode() tmp
-- 感觉可以理解为explode()就是新增一列,列名是item;lateral view是虚拟表,select、from、join都省略了,但是虚拟表名不可省略
-- 有id/items的情况下好像不会少空行(相当于没关联上)?
select id, item ,items
from array_table
lateral view explode(items) tmp as item
-- posexplode()
-- 相当于position + explode
select posexplode(items) as (item_index, item) from array_table;
-- 注意下posexplode()会有两列挂在一个as后,直接用,区分就行,前index后item
select id, item, item_index, items
from array_table
lateral view posexplode(items) tmp as item_index, item
-- lateral view outer explode() 保留空值
lateral view outer explode(items) tmp as item
-- 可以使用多个lateral view explode(),但相当于每行的两列值会全连接,即返回两列的笛卡尔积
-- 如果需要返回一行的两列顺序对应的值,可以用posexplode过滤,取索引相等(注意虚拟表名、列名不要重复)
with array_table as(
select 1 as id, array('apple','banana','pear') as items ,array(1,2,3) as title
union all
select 2 as id, array('watermelon','orange') as items ,array(1,2) as title
)
select id,item ,title2
from array_table
lateral view posexplode(items) tmp as idx ,item
lateral view posexplode(title) tmp2 as idx1,title2
where idx=idx1
-- 如果数据是类似 'apple,banana,pear',可以用split()拆成array
select id,item
from array_table
lateral view explode( split(items,',') ) tmp as item
日期补全
- datediff():统计日期差N,eg:DATEDIFF(“2023-03-22”, “2023-03-13”) -> N=9
- space(N):做N个空格
- split( N个空格 )并拆成array:拆成N+1个空串
- lateral view posexplode( ) tmp as pos,null_ele :空串行转列,得到:【索引列(0-N)】,null_ele(值为null,没用)
- DATE_ADD(START_DATE, pos):开始日期+索引,得到连续日期
-- 示例代码
SELECT
DATE_ADD(START_DATE, pos)
FROM (
SELECT DISTINCT
"2023-03-13" AS START_DATE,
"2023-03-22" AS END_DATE
) s1 lateral VIEW posexplode(split(SPACE(DATEDIFF(END_DATE, START_DATE)), " ")) s2 AS pos, null_ele
- 实际上,日期往往在同一列,同时需要保留/补全其他行的数据,算累计值
则思路应拆解为:
1. 用开窗排序取最大日期和最小日期,从而造出max和min两列
2. 取max和min日期差值,需要-1,因为对比上一个例子,同一列包含了原有的max和min(PS:space(-1)+split还是一个空串)
3. lateral view posexplode 造出索引列
4. mind+索引列,得到连续日期列,data列
5. 开窗用 substr月份分区 、 date_add(mind,idx) 正序排序,算月累计值
WITH test as (
SELECT '2024-01-31'as inc_day, 1 as data
UNION all
SELECT '2024-02-03'as inc_day,2 as data
)
SELECT
inc_day,data,date_add(mind,idx) as full_date
,sum(data) over (PARTITION by substr(date_add(mind,idx),1,7) ORDER by date_add(mind,idx) asc) as sum_data -- 月累计
,maxd,mind, datediff(maxd,mind) -1 as datedf, idx
from (
SELECT inc_day,data
,max(inc_day) over(ORDER by inc_day desc) as maxd
,min(inc_day) over(ORDER by inc_day desc) as mind
from test
) test
lateral view posexplode( split( space( datediff(maxd,mind) -1 ) , ' ' ) ) tmp as idx,ele
- 那么问题来了:
- 现在虚拟日期的data用的是补全的数,怎么还原原始数据并计算原始数据的月累计呢
- 思路1:如果虚拟日期等于实际日期,则保留原始数据
if(inc_day = date_add(inc_day,idx) , data , null) as origin_data
- 思路2:直接用虚拟日期列做为新主表,关联原表
- 思路1:如果虚拟日期等于实际日期,则保留原始数据
- 如果不止两个日期,取max和min就不对了
- 因此需要使用lead向后取日期(因为补全的后面的日期用前一天的数)
lead(col, step, bound)
- 因此需要使用lead向后取日期(因为补全的后面的日期用前一天的数)
- 现在虚拟日期的data用的是补全的数,怎么还原原始数据并计算原始数据的月累计呢
- 新示例代码
- 要求:求每个deptcode的真实日达成(如果为空则用0)、月累计(不能为空,且注意跨月时不能用上一个月的月累计)、目标达成率(月累计/月目标);其中目标值为手工数据,如果未录入则用上月目标值
- 详见explode函数test.sql
-- 注意两个点:日期是带-的格式;跑T-1任务时;'2024-07-05' 全部可以替换为T-0,'2024-07-02'可替换为yyyy-MM-02
with test as(
SELECT 'A1' as dept_code, 300 as income,'2024-06-30'as inc_day
union all
SELECT 'A1' as dept_code, 100 as income,'2024-07-02'as inc_day
union all
SELECT 'A1' as dept_code, 150 as income,'2024-07-04'as inc_day
union all
SELECT 'B2' as dept_code, 400 as income,'2024-06-29'as inc_day
union all
SELECT 'B2' as dept_code, 200 as income,'2024-07-03'as inc_day
),
target_table as(
SELECT 'A1' as dept_code, 300 as target,'2024-06'as inc_day
union all
SELECT 'A1' as dept_code, 400 as target,'2024-07'as inc_day
union all
SELECT 'B2' as dept_code, 500 as target,'2024-06'as inc_day
)
select
t.dept_code
,coalesce(a.income ,0) as income
,coalesce(a.income_y,0) as income_y
,coalesce(b.target ,0) as target
,coalesce(a.income_y,0)/coalesce(b.target ,0) as income_y_rate
,lag(coalesce(a.income,0),1) OVER(PARTITION by t.dept_code order by t.inc_day asc) as income_lastday
,t.inc_day
from(
-- 构造的主表(dept_code*日期)
select dept_code,date_add(inc_day,idx) as inc_day
from(
SELECT *,lead(inc_day,1,'2024-07-05') OVER(PARTITION by dept_code order by inc_day asc) as lead_day
FROM test
)
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
) t
left join (
select dept_code, if(inc_day=date_add(inc_day,idx),income,0 ) as income, income_y,date_add(inc_day,idx) as inc_day
from (
SELECT *
,sum(income) OVER (PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as income_y -- 计算月累计
,lead(inc_day,1,date_add( last_day( inc_day ), 1) ) OVER(PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as lead_day
-- 通过下月月初+分区限制月份。向下找maxday取到月初01号,即月累计等向下最多用到月末(月初-1)。但1号没有,下个月只能从有数的2号开始。。。
FROM test
) a
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
) a on t.inc_day = a.inc_day and t.dept_code = a.dept_code
left join(
-- 目标表
select DISTINCT dept_code,target ,substr(date_add(inc_day_concat,idx),1,7) as inc_day
from (
select *, concat(inc_day,'-01') as inc_day_concat
,lead(concat(inc_day,'-01'),1,'2024-07-02') OVER (PARTITION by dept_code order by inc_day asc) lead_day
-- 1号不行,得往后延一天(因为B2没有lastday,如果lead写1号会导致没有最新月)
from target_table
) a
lateral view posexplode(split( space( datediff(lead_day,concat(inc_day,'-01'))-1 ) ,' ')) tmp as idx,ele
) b on t.dept_code=b.dept_code and substr(t.inc_day,1,7)=b.inc_day
求每个deptcode的 真实日达成(如果为空则用0)、月累计(不能为空,且注意跨月时不能用上一个月的月累计)、
目标达成率(=月累计/月目标);其中目标值为手工数据,如果未录入则用上月目标值
with test as(
SELECT 'A1' as dept_code, 300 as income,'2024-06-30'as inc_day
union all
SELECT 'A1' as dept_code, 100 as income,'2024-07-02'as inc_day
union all
SELECT 'A1' as dept_code, 150 as income,'2024-07-04'as inc_day
union all
SELECT 'B2' as dept_code, 400 as income,'2024-06-29'as inc_day
union all
SELECT 'B2' as dept_code, 200 as income,'2024-07-03'as inc_day
)
,
target_table as(
SELECT 'A1' as dept_code, 300 as target,'2024-06'as inc_day
union all
SELECT 'A1' as dept_code, 400 as target,'2024-07'as inc_day
union all
SELECT 'B2' as dept_code, 500 as target,'2024-06'as inc_day
)
select a.*,b.*
from(
-- 收入表
select dept_code, if(inc_day=date_add(inc_day,idx),income,0 ), income_y,date_add(inc_day,idx) as inc_day
from (
SELECT *
,sum(income) OVER (PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as income_y -- 计算月累计
,lead(inc_day,1,'2024-07-05') OVER(PARTITION by dept_code order by inc_day asc) as lead_day
FROM test
) a
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
) a
left join(
-- 目标表
select DISTINCT dept_code,target ,substr(date_add(inc_day_concat,idx),1,7) as inc_day
from (
select *, concat(inc_day,'-01') as inc_day_concat
,lead(concat(inc_day,'-01'),1,'2024-07-02') OVER (PARTITION by dept_code order by inc_day asc) lead_day
-- 1号不行,得往后延一天(因为B2没有lastday,如果lead写1号会导致没有最新月)
from target_table
) a
lateral view posexplode(split( space( datediff(lead_day,concat(inc_day,'-01'))-1 ) ,' ')) tmp as idx,ele
) b on a.dept_code=b.dept_code and substr(a.inc_day,1,7)=b.inc_day
-- 还是不对!不满足跨月的月累计数量清零
-- 但是像晨会的业务场景就是对的,如果7月1日数据没出,就看6月的月累计数据(看板文字只有月达成,没有XX月,不会造成歧义)
-- 收入表,这样月累计对了,不会发生7.1月累计取到6.30的情况。但是1号没数的话就没法造出新行了
select dept_code, if(inc_day=date_add(inc_day,idx),income,0 ) as income, income_y,date_add(inc_day,idx) as inc_day
from (
SELECT *
,sum(income) OVER (PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as income_y -- 计算月累计
,lead(inc_day,1,date_add( last_day( inc_day ), 1) ) OVER(PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as lead_day
-- 通过下月月初+分区限制月份。向下找maxday取到月初01号,即月累计等向下最多用到月末(月初-1)。但1号没有,下个月只能从有数的2号开始。。。
FROM test
) a
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
order by dept_code,inc_day
-- 将上段进行优化,使得test表1-N号没数,收入和月累计都为0
-- 考虑将虚拟日期表作为主表,拼接收入表
-- 日数据直接取收入表,空取0
-- 月累计用上述写法,空取0
-- 注意两个点:日期是带-的格式;跑T-1任务时;'2024-07-05' 全部可以替换为T-0,'2024-07-02'可替换为yyyy-MM-02
with test as(
SELECT 'A1' as dept_code, 300 as income,'2024-06-30'as inc_day
union all
SELECT 'A1' as dept_code, 100 as income,'2024-07-02'as inc_day
union all
SELECT 'A1' as dept_code, 150 as income,'2024-07-04'as inc_day
union all
SELECT 'B2' as dept_code, 400 as income,'2024-06-29'as inc_day
union all
SELECT 'B2' as dept_code, 200 as income,'2024-07-03'as inc_day
),
target_table as(
SELECT 'A1' as dept_code, 300 as target,'2024-06'as inc_day
union all
SELECT 'A1' as dept_code, 400 as target,'2024-07'as inc_day
union all
SELECT 'B2' as dept_code, 500 as target,'2024-06'as inc_day
)
select
t.dept_code
,coalesce(a.income ,0) as income
,coalesce(a.income_y,0) as income_y
,coalesce(b.target ,0) as target
,coalesce(a.income_y,0)/coalesce(b.target ,0) as income_y_rate
,lag(coalesce(a.income,0),1) OVER(PARTITION by t.dept_code order by t.inc_day asc) as income_lastday
,t.inc_day
from(
-- 构造的主表(dept_code*日期)
select dept_code,date_add(inc_day,idx) as inc_day
from(
SELECT *,lead(inc_day,1,'2024-07-05') OVER(PARTITION by dept_code order by inc_day asc) as lead_day
FROM test
)
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
) t
left join (
select dept_code, if(inc_day=date_add(inc_day,idx),income,0 ) as income, income_y,date_add(inc_day,idx) as inc_day
from (
SELECT *
,sum(income) OVER (PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as income_y -- 计算月累计
,lead(inc_day,1,date_add( last_day( inc_day ), 1) ) OVER(PARTITION by dept_code, substr(inc_day,1,7) order by inc_day asc) as lead_day
-- 通过下月月初+分区限制月份。向下找maxday取到月初01号,即月累计等向下最多用到月末(月初-1)。但1号没有,下个月只能从有数的2号开始。。。
FROM test
) a
lateral view posexplode( split( space(datediff(lead_day,inc_day)-1 ) ,' ') ) tmp as idx,ele
) a on t.inc_day = a.inc_day and t.dept_code = a.dept_code
left join(
-- 目标表
select DISTINCT dept_code,target ,substr(date_add(inc_day_concat,idx),1,7) as inc_day
from (
select *, concat(inc_day,'-01') as inc_day_concat
,lead(concat(inc_day,'-01'),1,'2024-07-02') OVER (PARTITION by dept_code order by inc_day asc) lead_day
-- 1号不行,得往后延一天(因为B2没有lastday,如果lead写1号会导致没有最新月)
from target_table
) a
lateral view posexplode(split( space( datediff(lead_day,concat(inc_day,'-01'))-1 ) ,' ')) tmp as idx,ele
) b on t.dept_code=b.dept_code and substr(t.inc_day,1,7)=b.inc_day
-- 思路2:月累计直接用lag函数,按月向上取;取不到则0