FlinkSql,如何开窗,如何进行窗口内计算

目录

准备一个表

 开窗

对开窗后的数据进行查询

对窗口内的数据进行聚合运算

对窗口数据使用OVER聚合(窗口函数)

对两个开窗表进行join


注意flink版本,我之前用的1.13.0,这一部分代码可能会有bug,而且13版本不支持Window join 我的版本是1.14.3_2.12

准备一个表

Flink SQL>  CREATE TEMPORARY TABLE Bid( 
  bidtime TIMESTAMP,
  price DECIMAL(10, 2),
  item STRING,
  watermark for bidtime as bidtime  - INTERVAL '1' SECOND
) WITH (
  'connector' = 'kafka',  
  'topic' = 'flink_test', 
  'properties.bootstrap.servers' = 'hdp01:6667', 
  'properties.group.id' = 'park_test_group1', 
  'scan.startup.mode' = 'latest-offset',
  'value.format'='json',
  'value.json.fail-on-missing-field' = 'false',
  'value.json.ignore-parse-errors' = 'true',
)

Flink SQL> desc Bid;
+-------------+------------------------+------+-----+--------+---------------------------------+
|        name |                   type | null | key | extras |                       watermark |
+-------------+------------------------+------+-----+--------+---------------------------------+
|     bidtime | TIMESTAMP(3) *ROWTIME* | true |     |        | `bidtime` - INTERVAL '1' SECOND |
|       price |         DECIMAL(10, 2) | true |     |        |                                 |
|        item |                 STRING | true |     |        |                                 |
+-------------+------------------------+------+-----+--------+---------------------------------+

字段类型转换 bigint 如何作为watermark 

TO_TIMESTAMP_LTZ(ts,3) 

TO_TIMESTAMP_LTZ(numeric, precision)toTimestampLtz(NUMERIC, PRECISION)Converts a epoch seconds or epoch milliseconds to a TIMESTAMP_LTZ, the valid precision is 0 or 3, the 0 represents TO_TIMESTAMP_LTZ(epochSeconds, 0), the 3 represents TO_TIMESTAMP_LTZ(epochMilliseconds, 3).
create table page_log(
    page map<string,string>,
    ts bigint,
    rt TO_TIMESTAMP_LTZ(ts,3), 
    WATERMARK FOR `rt` As `rt` - INTERVAL 2 'SECOND
)

 开窗

将表应用一下TVF(table-valued function )函数,返回一个开窗的表,开窗TVF函数目前有三个

  • TUMBLE
    • TUMBLE(TABLE data, DESCRIPTOR(timecol), size)
  • HOP
    • HOP(TABLE data, DESCRIPTOR(timecol), slide, size [, offset ])
  • CUMULATE
    • ​CUMULATE(TABLE data, DESCRIPTOR(timecol), step, size)

对开窗后的数据进行查询,目前只有以下三种方式

准备数据

{"bidtime":"2020-04-15 08:05","price":4.00,"item":"C","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:07","price":2.00,"item":"A","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:09","price":5.00,"item":"D","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:11","price":3.00,"item":"B","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:13","price":1.00,"item":"E","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:17","price":6.00,"item":"F","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:27","price":6.00,"item":"F","supplier_id":"supplier2"}

以最简单的Tumble,滚动窗口作为例子,无论什么开窗方式,查询方法是一样的

3种使用方式

对窗口内的数据进行聚合运算

问题:每10分钟求最近10分钟的总交易额

SELECT window_start, window_end, SUM(price) as sum_price
FROM TABLE(
     TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)
)
-- group by 中必须携带 window_start, window_end
GROUP BY window_start, window_end

+----+-------------------------+-------------------------+------------------------------------------+
| op |            window_start |              window_end |                                sum_price |
+----+-------------------------+-------------------------+------------------------------------------+
| +I | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 |                                    11.00 |
| +I | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 |                                    10.00 |

对窗口数据使用OVER聚合(窗口函数)

问题:每10分钟求最近10分钟交易额最大的前2条记录

SELECT * from (SELECT 
    bidtime,
    price,
    item,
    supplier_id,
    window_start, 
    window_end, 
    row_number() over(partition by window_start, window_end order by price desc) as row_num 
 FROM TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
) where row_num < 3
+----+-------------------------+------+-----+------------+-------------------------+-------------------------+---------+
| op |                 bidtime |price |item |supplier_id |            window_start |              window_end | row_num |
+----+-------------------------+------+-----+------------+-------------------------+-------------------------+---------+
| +I | 2020-04-15 08:09:00.000 | 5.00 |   D |  supplier2 | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 |       1 |
| +I | 2020-04-15 08:05:00.000 | 4.00 |   C |  supplier1 | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 |       2 |
| +I | 2020-04-15 08:17:00.000 | 6.00 |   F |  supplier2 | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 |       1 |
| +I | 2020-04-15 08:11:00.000 | 3.00 |   B |  supplier2 | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 |       2 |

对两个开窗表进行join

Window JOIN | Apache Flink

Flink SQL> desc LeftTable;
+----------+------------------------+------+-----+--------+----------------------------------+
|     name |                   type | null | key | extras |                        watermark |
+----------+------------------------+------+-----+--------+----------------------------------+
| row_time | TIMESTAMP(3) *ROWTIME* | true |     |        | `row_time` - INTERVAL '1' SECOND |
|      num |                    INT | true |     |        |                                  |
|       id |                 STRING | true |     |        |                                  |
+----------+------------------------+------+-----+--------+----------------------------------+

Flink SQL> SELECT * FROM LeftTable;
+------------------+-----+----+
|         row_time | num | id |
+------------------+-----+----+
| 2020-04-15 12:02 |   1 | L1 |
| 2020-04-15 12:06 |   2 | L2 |
| 2020-04-15 12:03 |   3 | L3 |
+------------------+-----+----+

Flink SQL> desc RightTable;
+----------+------------------------+------+-----+--------+----------------------------------+
|     name |                   type | null | key | extras |                        watermark |
+----------+------------------------+------+-----+--------+----------------------------------+
| row_time | TIMESTAMP(3) *ROWTIME* | true |     |        | `row_time` - INTERVAL '1' SECOND |
|      num |                    INT | true |     |        |                                  |
|       id |                 STRING | true |     |        |                                  |
+----------+------------------------+------+-----+--------+----------------------------------+

Flink SQL> SELECT * FROM RightTable;
+------------------+-----+----+
|         row_time | num | id |
+------------------+-----+----+
| 2020-04-15 12:01 |   2 | R2 |
| 2020-04-15 12:04 |   3 | R3 |
| 2020-04-15 12:05 |   4 | R4 |
+------------------+-----+----+

Flink SQL> SELECT L.num as L_Num, L.id as L_Id, R.num as R_Num, R.id as R_Id, L.window_start, L.window_end
           FROM (
               SELECT * FROM TABLE(TUMBLE(TABLE LeftTable, DESCRIPTOR(row_time), INTERVAL '5' MINUTES))
           ) L
           FULL JOIN (
               SELECT * FROM TABLE(TUMBLE(TABLE RightTable, DESCRIPTOR(row_time), INTERVAL '5' MINUTES))
           ) R
           ON L.num = R.num AND L.window_start = R.window_start AND L.window_end = R.window_end;
+-------+------+-------+------+------------------+------------------+
| L_Num | L_Id | R_Num | R_Id |     window_start |       window_end |
+-------+------+-------+------+------------------+------------------+
|     1 |   L1 |  null | null | 2020-04-15 12:00 | 2020-04-15 12:05 |
|  null | null |     2 |   R2 | 2020-04-15 12:00 | 2020-04-15 12:05 |
|     3 |   L3 |     3 |   R3 | 2020-04-15 12:00 | 2020-04-15 12:05 |
|     2 |   L2 |  null | null | 2020-04-15 12:05 | 2020-04-15 12:10 |
|  null | null |     4 |   R4 | 2020-04-15 12:05 | 2020-04-15 12:10 |
+-------+------+-------+------+------------------+------------------+

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值