with as常见问题,大概有两类:
1、为啥我用了with..as效率没有提高?
2、sql跑不动,改成with..as的写法,会不会更好些?
网上博客几乎都有结论with ... as语句会把数据放在内存:
前言1:
hive可以通过with查询来提高查询性能,因为先通过with语法将数据查询到内存(既然是加到内存中,如果数据量太大,估计会占用大量资源,可能会报内存溢出),然后后面其他查询可以直接使用,这种方法与创建临时表类似但是不需要创建临时表实体表,内存中的子查询结果再会话结束后会自动删除。使用with as 可提高代码可读性。
前言2:
公用表表达式(CTE)是从with 子句中指定的简单查询派生的临时结果表(会把查询的表数据放到内存中,供其他查询随时使用),该子句紧跟再select或者insert关键字之前,CTE仅在单个语句的执行范围内定义。可以在hive select,insert,create table as select或者create view as select语句中使用一个或多个CTE
使用hive-1180在hive 0.13.0中添加了通用表达式。
简介:
with ... as... 需要定义一个sql片段,会将这个片段产生的结果集保存在内存中
后续的sql均可以访问这个结果集合,作用与视图或临时表类似
语法说明
1.with...as...必须和其他语句一起使用
2.with...as...是一次性的
Hive-sql
在hive中有一个参数
hive.optimize.cte.materialize.threshold
这个参数在默认情况下是-1(关闭的);当开启(大于0),比如设置为2,则如果with..as语句被引用2次及以上时,会把with..as语句生成的table物化,从而做到with..as语句只执行一次,来提高效率。
测试
explain
with atable as (
SELECT id,source,channel
FROM test
)
select source from atable WHERE channel = '直播'
union ALL
select source from atable WHERE channel = '视频'
不设置该参数时,执行计划:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (channel = '直播') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: source (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
alias: test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (channel = '视频') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: source (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
从执行计划上看,test表被读两次。
设置set hive.optimize.cte.materialize.threshold=1
,执行计划:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-6 depends on stages: Stage-1 , consists of Stage-3, Stage-2, Stage-4
Stage-3
Stage-0 depends on stages: Stage-3, Stage-2, Stage-5
Stage-8 depends on stages: Stage-0
Stage-2
Stage-4
Stage-5 depends on stages: Stage-4
Stage-7 depends on stages: Stage-8
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: id (type: int), source (type: string), channel (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.atable
Stage: Stage-6
Conditional Operator
Stage: Stage-3
Move Operator
files:
hdfs directory: true
destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002
Stage: Stage-0
Move Operator
files:
hdfs directory: true
destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b
Stage: Stage-8
Map Reduce
Map Operator Tree:
TableScan
alias: atable
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (channel = '直播') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: source (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
alias: atable
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (channel = '视频') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: source (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.atable
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.atable
Stage: Stage-5
Move Operator
files:
hdfs directory: true
destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002
Stage: Stage-7
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到test表被物化了。
源码:
从源码看,在获取元数据时,会做参数判断,判断参数阈值及cte的引用次数
spark-sql
spark对cte的操作比较少,在spark侧,现在还没发现有相关的优化参数
with atable as (
SELECT content_type,
channel,
channel_note,
enter_method,
enter_method_note
FROM search_dw.dim_ecom_enter_channel_df
)
select channel from atable WHERE content_type = '直播'
union ALL
select channel from atable WHERE content_type = '视频'