Presto/Trino中select语句的使用(union、intersect、except和grouping set、rollup、cube,with内联视图和子查询)、over开窗函数的使用

1. select查询

语法如下:

[ WITH [ RECURSIVE ] with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ WINDOW window_definition_list]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT { count | ALL } ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
  • having子句在group by之后计算,作用于分组聚合后的行
  • join默认是inner join,还是left outer join、right outer join、full outer join、cross join(对于from a, b where a.key = b.key,前半部分是cronss join)

基础的select例子如下:

trino>
trino> select upper(b.name) customer_name, round(sum(totalprice), 1) total_price 
    -> from tpch.sf1.orders a join tpch.sf1.customer b on a.custkey = b.custkey 
    -> where extract(year from a.orderdate) = 1997 or b.nationkey = 1 
    -> group by b.name 
    -> having round(sum(totalprice), 1) > 10000 
    -> order by total_price desc limit 3;
   customer_name    | total_price 
--------------------+-------------
 CUSTOMER#000111280 |   5776853.7 
 CUSTOMER#000126199 |   5682036.3 
 CUSTOMER#000015751 |   5594552.2 
(3 rows)

Query 20220109_222708_00100_jiymz, FINISHED, 2 nodes
Splits: 14 total, 14 done (100.00%)
4.40 [1.95M rows, 13.2MB] [443K rows/s, 2.99MB/s]

trino> 

1.1 union、intersect、except

  • union all:合并所有的行,不去除重复行
  • union:合并所有的行,去除重复行
  • intersect:返回两个查询中都有的行,但去除重复行
  • except:返回只有第一个查询中有的行,但去除重复行
trino>
trino> select * from (values (1, 2), (1, 2), (2, 3)) except select * from (values (2, 3), (3, 4));
 _col0 | _col1 
-------+-------
     1 |     2 
(1 row)

Query 20220109_223715_00107_jiymz, FINISHED, 2 nodes
Splits: 9 total, 9 done (100.00%)
0.98 [0 rows, 0B] [0 rows/s, 0B/s]

trino> 

1.2 grouping set、rollup、cube

分别对每个分组组合进行聚合计算,结果是所有计算结果的联合,没有值的列,则为NULL

例如对于聚合列(a, b, c):

  • rollup(a, b, c)生成基于层次结构的分组组合(a, b, c), (a, b), (a), ()
  • cube(a, b, c)生成所有可能的分组组合(a, b, c), (a, b), (a, c), (b, c), (a), (b), ©, ()
  • grouping sets可以自定义分组组合,如grouping sets((a, b, c), ())
trino>
trino> select orderstatus, orderpriority, shippriority, sum(totalprice) totalprice, grouping(orderstatus, orderpriority, shippriority) as group_id 
    -> from tpch.sf1.orders 
    -> group by grouping sets((orderstatus, orderpriority, shippriority), ()) 
    -> order by group_id;
 orderstatus |  orderpriority  | shippriority |      totalprice       | group_id 
-------------+-----------------+--------------+-----------------------+----------
 O           | 5-LOW           |            0 | 2.2075989903380028E10 |        0 
 O           | 1-URGENT        |            0 |  2.202843787702009E10 |        0 
 F           | 5-LOW           |            0 | 2.1997577898639805E10 |        0 
 F           | 4-NOT SPECIFIED |            0 |  2.194042389442989E10 |        0 
 O           | 2-HIGH          |            0 | 2.2042604367650177E10 |        0 
 F           | 3-MEDIUM        |            0 | 2.1777505918090096E10 |        0 
 P           | 4-NOT SPECIFIED |            0 |  1.4212495106299987E9 |        0 
 P           | 3-MEDIUM        |            0 |  1.4197204557000003E9 |        0 
 P           | 5-LOW           |            0 |   1.427590893769999E9 |        0 
 O           | 3-MEDIUM        |            0 | 2.1956381714669975E10 |        0 
 O           | 4-NOT SPECIFIED |            0 | 2.1914360578039955E10 |        0 
 P           | 1-URGENT        |            0 |   1.402424471659998E9 |        0 
 F           | 2-HIGH          |            0 | 2.1999039814129982E10 |        0 
 F           | 1-URGENT        |            0 | 2.1987867088400032E10 |        0 
 P           | 2-HIGH          |            0 |  1.4381320612500029E9 |        0 
 NULL        | NULL            |         NULL | 2.2682930644745822E11 |        7 
(16 rows)

Query 20220109_230647_00123_jiymz, FINISHED, 2 nodes
Splits: 11 total, 11 done (100.00%)
1:16 [1.5M rows, 0B] [19.9K rows/s, 0B/s]

trino>
  • grouping(col1, col2, col3)函数用于标识一行属于哪个分组,计算出来的group id可能不是连续的

1.3 with内联视图

用于在一个查询中定义一个内联视图,供多次查询访问

trino>
trino> with 
    -> total as (
    -> select mktsegment,
    -> sum(acctbal) as total_per_mktsegment
    -> from tpch.sf1.customer
    -> group by 1
    -> ),
    -> average as (
    -> select avg(total_per_mktsegment) as mktsegment_average
    -> from total
    -> )
    -> select mktsegment,
    -> total_per_mktsegment,
    -> mktsegment_average
    -> from total,
    -> average
    -> where total_per_mktsegment > mktsegment_average;
 mktsegment | total_per_mktsegment | mktsegment_average 
------------+----------------------+--------------------
 HOUSEHOLD  |  1.358733411699999E8 |    1.34865369948E8 
 BUILDING   | 1.3588862194000012E8 |    1.34865369948E8 
(2 rows)

Query 20220109_231949_00129_jiymz, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
3.74 [300K rows, 0B] [80.1K rows/s, 0B/s]

trino>

average内联视图引用了total内联视图,所以total内联视图被引用了两次。
Presto/Trino会基于代价的计算,判断是否将total内联视图的结果进行保存缓存

1.4 子查询

1. 标量子查询

子查询返回单个值(一行一列), 子查询只计算一次

trino>
trino> select name from tpch.sf1.nation 
    -> where regionkey = (
    -> select regionkey from tpch.sf1.region where name = 'AMERICA'
    -> );
     name      
---------------
 ARGENTINA     
 BRAZIL        
 CANADA        
 PERU          
 UNITED STATES 
(5 rows)

Query 20220109_233108_00136_jiymz, FINISHED, 2 nodes
Splits: 11 total, 11 done (100.00%)
1.15 [32 rows, 116B] [27 rows/s, 101B/s]

trino>

2. exists子查询

  • in子查询:如果一个列的值存在exists子查询的结果中,则为true
  • not in子查询:如果一个列的值不存在exists子查询的结果中,则为true
trino>
trino> select name from tpch.sf1.nation 
    -> where regionkey not in (
    -> select regionkey from tpch.sf1.region
    -> );
 name 
------
(0 rows)

Query 20220110_001101_00000_jiymz, FINISHED, 2 nodes
Splits: 9 total, 9 done (100.00%)
0.31 [40 rows, 244B] [128 rows/s, 785B/s]

trino>

3. 集合比较子查询

较的运算符除了=,还有<、>、<=、>=、=、<>

集合比较子查询的查询方式有两种:

  1. any:表示只需满足子查询中的一个值, = any等价于in子查询
trino> 
trino> select name from tpch.sf1.nation where regionkey = any (select regionkey from tpch.sf1.region);
      name      
----------------
 ALGERIA        
 ARGENTINA      
 ......省略部分......
 UNITED KINGDOM 
 UNITED STATES  
(25 rows)

Query 20220110_001814_00005_jiymz, FINISHED, 2 nodes
Splits: 13 total, 13 done (100.00%)
0.37 [40 rows, 328B] [107 rows/s, 879B/s]

trino> 
trino> select name from tpch.sf1.nation where regionkey in (select regionkey from tpch.sf1.region);
      name      
----------------
 ALGERIA        
 ARGENTINA      
 ......省略部分......     
 UNITED KINGDOM 
 UNITED STATES  
(25 rows)

Query 20220110_001821_00006_jiymz, FINISHED, 2 nodes
Splits: 13 total, 13 done (100.00%)
0.36 [40 rows, 328B] [111 rows/s, 916B/s]

trino> 

对于any, 如果子查询中的一个值为NULL,则对应的集合比较子查询的结果是NULL(不是true和false)。合并结果时一个false和NULL进行合并, 则计算结果为NULL

  1. all:表示必须满足子查询的所有值

对于all,如果子查询为空,则集合比较子查询的结果为true。如果子查询中的一个值为NULL,则对应的集合比较子查询的结果是NULL(不是true和false)。合并结果时一个true和NULL进行合并, 则计算结果为NULL

2. SQL函数

2.1 over (partition by col order by col)

生成测试数据

trino> create schema memory.test_db;
CREATE SCHEMA
trino> 
trino> 
trino> create table memory.test_db.test_tb(
    -> user_id bigint,
    -> buy_year int,
    -> buy_amount int
    -> );
CREATE TABLE
trino> 
trino> 
trino> insert into memory.test_db.test_tb(user_id, buy_year, buy_amount) 
    -> values(1, 2012, 30), 
    -> (1, 2010, 20), 
    -> (1, 2009, 10), 
    -> (2, 2022, 300), 
    -> (2, 2020, 200), 
    -> (2, 2019, 100);
INSERT: 6 rows

Query 20220210_001404_00009_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
0.24 [0 rows, 0B] [0 rows/s, 0B/s]

trino>

1. 开窗求迭代累加和

trino> 
trino> select user_id, buy_year, buy_amount, 
    -> sum(buy_amount) over (partition by user_id order by buy_year) as user_iteration_sum_buy_amount 
    -> from memory.test_db.test_tb;
 user_id | buy_year | buy_amount | user_iteration_sum_buy_amount 
---------+----------+------------+-------------------------------
       1 |     2009 |         10 |                            10 
       1 |     2010 |         20 |                            30 
       1 |     2012 |         30 |                            60 
       2 |     2019 |        100 |                           100 
       2 |     2020 |        200 |                           300 
       2 |     2022 |        300 |                           600 
(6 rows)

Query 20220210_001707_00010_jx84g, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
0.24 [6 rows, 114B] [24 rows/s, 469B/s]

trino> 

2. 开窗获取前N行的值

lag(col, offset)中的offset值为0或正整数,如果offset为0,表示当前行;offset为2,表示前面第二行

trino> 
trino> select user_id, buy_year, buy_amount, 
    -> lag(buy_amount, 1) over (partition by user_id order by buy_year) as previous_buy_amount 
    -> from memory.test_db.test_tb;
 user_id | buy_year | buy_amount | previous_buy_amount 
---------+----------+------------+---------------------
       1 |     2009 |         10 |                NULL 
       1 |     2010 |         20 |                  10 
       1 |     2012 |         30 |                  20 
       2 |     2019 |        100 |                NULL 
       2 |     2020 |        200 |                 100 
       2 |     2022 |        300 |                 200 
(6 rows)

Query 20220210_001915_00011_jx84g, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
0.27 [6 rows, 114B] [21 rows/s, 418B/s]

trino> 
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值