目录
官网例子 Windowing TVF | Apache Flink
注意flink版本,我之前用的1.13.0,这一部分代码可能会有bug,而且13版本不支持Window join 我的版本是1.14.3_2.12
官网例子 Windowing TVF | Apache Flink
准备一个表
Flink SQL> CREATE TEMPORARY TABLE Bid(
bidtime TIMESTAMP,
price DECIMAL(10, 2),
item STRING,
watermark for bidtime as bidtime - INTERVAL '1' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'flink_test',
'properties.bootstrap.servers' = 'hdp01:6667',
'properties.group.id' = 'park_test_group1',
'scan.startup.mode' = 'latest-offset',
'value.format'='json',
'value.json.fail-on-missing-field' = 'false',
'value.json.ignore-parse-errors' = 'true',
)
Flink SQL> desc Bid;
+-------------+------------------------+------+-----+--------+---------------------------------+
| name | type | null | key | extras | watermark |
+-------------+------------------------+------+-----+--------+---------------------------------+
| bidtime | TIMESTAMP(3) *ROWTIME* | true | | | `bidtime` - INTERVAL '1' SECOND |
| price | DECIMAL(10, 2) | true | | | |
| item | STRING | true | | | |
+-------------+------------------------+------+-----+--------+---------------------------------+
字段类型转换 bigint 如何作为watermark
TO_TIMESTAMP_LTZ(ts,3)
TO_TIMESTAMP_LTZ(numeric, precision) | toTimestampLtz(NUMERIC, PRECISION) | Converts a epoch seconds or epoch milliseconds to a TIMESTAMP_LTZ, the valid precision is 0 or 3, the 0 represents TO_TIMESTAMP_LTZ(epochSeconds, 0), the 3 represents TO_TIMESTAMP_LTZ(epochMilliseconds, 3). |
create table page_log(
page map<string,string>,
ts bigint,
rt TO_TIMESTAMP_LTZ(ts,3),
WATERMARK FOR `rt` As `rt` - INTERVAL 2 'SECOND
)
开窗
将表应用一下TVF(table-valued function )函数,返回一个开窗的表,开窗TVF函数目前有三个
- TUMBLE
-
TUMBLE(TABLE data, DESCRIPTOR(timecol), size)
-
- HOP
-
HOP(TABLE data, DESCRIPTOR(timecol), slide, size [, offset ])
-
- CUMULATE
-
CUMULATE(TABLE data, DESCRIPTOR(timecol), step, size)
-
对开窗后的数据进行查询,目前只有以下三种方式
准备数据
{"bidtime":"2020-04-15 08:05","price":4.00,"item":"C","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:07","price":2.00,"item":"A","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:09","price":5.00,"item":"D","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:11","price":3.00,"item":"B","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:13","price":1.00,"item":"E","supplier_id":"supplier1"}
{"bidtime":"2020-04-15 08:17","price":6.00,"item":"F","supplier_id":"supplier2"}
{"bidtime":"2020-04-15 08:27","price":6.00,"item":"F","supplier_id":"supplier2"}
以最简单的Tumble,滚动窗口作为例子,无论什么开窗方式,查询方法是一样的
3种使用方式
对窗口内的数据进行聚合运算
问题:每10分钟求最近10分钟的总交易额
SELECT window_start, window_end, SUM(price) as sum_price
FROM TABLE(
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)
)
-- group by 中必须携带 window_start, window_end
GROUP BY window_start, window_end
+----+-------------------------+-------------------------+------------------------------------------+
| op | window_start | window_end | sum_price |
+----+-------------------------+-------------------------+------------------------------------------+
| +I | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 | 11.00 |
| +I | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 | 10.00 |
对窗口数据使用OVER聚合(窗口函数)
问题:每10分钟求最近10分钟交易额最大的前2条记录
SELECT * from (SELECT
bidtime,
price,
item,
supplier_id,
window_start,
window_end,
row_number() over(partition by window_start, window_end order by price desc) as row_num
FROM TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
) where row_num < 3
+----+-------------------------+------+-----+------------+-------------------------+-------------------------+---------+
| op | bidtime |price |item |supplier_id | window_start | window_end | row_num |
+----+-------------------------+------+-----+------------+-------------------------+-------------------------+---------+
| +I | 2020-04-15 08:09:00.000 | 5.00 | D | supplier2 | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 | 1 |
| +I | 2020-04-15 08:05:00.000 | 4.00 | C | supplier1 | 2020-04-15 08:00:00.000 | 2020-04-15 08:10:00.000 | 2 |
| +I | 2020-04-15 08:17:00.000 | 6.00 | F | supplier2 | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 | 1 |
| +I | 2020-04-15 08:11:00.000 | 3.00 | B | supplier2 | 2020-04-15 08:10:00.000 | 2020-04-15 08:20:00.000 | 2 |
对两个开窗表进行join
Window JOIN | Apache Flink
Flink SQL> desc LeftTable;
+----------+------------------------+------+-----+--------+----------------------------------+
| name | type | null | key | extras | watermark |
+----------+------------------------+------+-----+--------+----------------------------------+
| row_time | TIMESTAMP(3) *ROWTIME* | true | | | `row_time` - INTERVAL '1' SECOND |
| num | INT | true | | | |
| id | STRING | true | | | |
+----------+------------------------+------+-----+--------+----------------------------------+
Flink SQL> SELECT * FROM LeftTable;
+------------------+-----+----+
| row_time | num | id |
+------------------+-----+----+
| 2020-04-15 12:02 | 1 | L1 |
| 2020-04-15 12:06 | 2 | L2 |
| 2020-04-15 12:03 | 3 | L3 |
+------------------+-----+----+
Flink SQL> desc RightTable;
+----------+------------------------+------+-----+--------+----------------------------------+
| name | type | null | key | extras | watermark |
+----------+------------------------+------+-----+--------+----------------------------------+
| row_time | TIMESTAMP(3) *ROWTIME* | true | | | `row_time` - INTERVAL '1' SECOND |
| num | INT | true | | | |
| id | STRING | true | | | |
+----------+------------------------+------+-----+--------+----------------------------------+
Flink SQL> SELECT * FROM RightTable;
+------------------+-----+----+
| row_time | num | id |
+------------------+-----+----+
| 2020-04-15 12:01 | 2 | R2 |
| 2020-04-15 12:04 | 3 | R3 |
| 2020-04-15 12:05 | 4 | R4 |
+------------------+-----+----+
Flink SQL> SELECT L.num as L_Num, L.id as L_Id, R.num as R_Num, R.id as R_Id, L.window_start, L.window_end
FROM (
SELECT * FROM TABLE(TUMBLE(TABLE LeftTable, DESCRIPTOR(row_time), INTERVAL '5' MINUTES))
) L
FULL JOIN (
SELECT * FROM TABLE(TUMBLE(TABLE RightTable, DESCRIPTOR(row_time), INTERVAL '5' MINUTES))
) R
ON L.num = R.num AND L.window_start = R.window_start AND L.window_end = R.window_end;
+-------+------+-------+------+------------------+------------------+
| L_Num | L_Id | R_Num | R_Id | window_start | window_end |
+-------+------+-------+------+------------------+------------------+
| 1 | L1 | null | null | 2020-04-15 12:00 | 2020-04-15 12:05 |
| null | null | 2 | R2 | 2020-04-15 12:00 | 2020-04-15 12:05 |
| 3 | L3 | 3 | R3 | 2020-04-15 12:00 | 2020-04-15 12:05 |
| 2 | L2 | null | null | 2020-04-15 12:05 | 2020-04-15 12:10 |
| null | null | 4 | R4 | 2020-04-15 12:05 | 2020-04-15 12:10 |
+-------+------+-------+------+------------------+------------------+