官方解释
Lookup Joins
Lookup Joins are a type of join in streaming queries. It is used to enrich a table with data that is queried from Paimon. The join requires one table to have a processing time attribute and the other table to be backed by a lookup source connector.
数据准备
//customers
{"id":1,"name":"001","country":"中国","zip":"zip01"}
{"id":2,"name":"002","country":"加拿大","zip":"zip02"}
{"id":3,"name":"003","country":"新加坡","zip":"zip03"}
{"id":4,"name":"004","country":"日本","zip":"zip04"}
{"id":5,"name":"005","country":"英国","zip":"zip05"}
{"id":6,"name":"006","country":"法国","zip":"zip06"}
//Orders
{"order_id":10001,"total":1000001,"customer_id":1}
{"order_id":10002,"total":1000002,"customer_id":2}
{"order_id":10003,"total":1000003,"customer_id":3}
{"order_id":10004,"total":1000004,"customer_id":4}
{"order_id":10005,"total":1000005,"customer_id":5}
{"order_id":10006,"total":1000006,"customer_id":6}
DDL、DML
CREATE TABLE if not exists customers (
id INT PRIMARY KEY NOT ENFORCED,
name STRING,
country STRING,
zip STRING,
create_time TIMESTAMP_LTZ(3) -- now()
);
insert into customers (
id,
name,
country,
zip,
create_time)
select
id,
name,
country,
zip,
now()
from InputTable;
CREATE TEMPORARY TABLE Orders (
order_id INT,
total INT,
customer_id INT,
order_time String, -- now()
proc_time AS PROCTIME()
) WITH (
'connector' = 'kafka',
'topic' = 'test-paimon',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
);
查询
//1:先在维表插入3条数据
+----+-----+------+---------+-------+-------------------------+
| op | id | name | country | zip | create_time |
+----+-----+------+---------+-------+-------------------------+
| +I | 1 | 001 | 中国 | zip01 | 2023-07-07 16:37:24.257 |
| +I | 2 | 002 | 加拿大 | zip02 | 2023-07-07 16:37:24.257 |
| +I | 3 | 003 | 新加坡 | zip03 | 2023-07-07 16:37:26.260 |
+----+-----+------+---------+-------+-------------------------+
/*
查询语句
SELECT o.order_id,
o.total,
c.country,
c.zip,
o.order_time,
c.create_time
FROM Orders AS o
JOIN customers
FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id
*/
//2:插入数据{"order_id":10001,"total":1000001,"customer_id":1}
//3:查询结果
+----+---------+-------------+---------+-------+--------------------+-------------------------+
| op |order_id | total | country | zip | order_time | create_time |
+----+---------+-------------+---------+-------+--------------------+-------------------------+
| +I | 10001 | 1000001 | 中国 | zip01 |2023-07-07 16:47:16 | 2023-07-07 16:37:24.257 |
+----+---------+-------------+---------+-------+--------------------+-------------------------+
//4:order 表继续插入数据
/*
{"order_id":10002,"total":1000002,"customer_id":2}
{"order_id":10003,"total":1000003,"customer_id":3}
{"order_id":10004,"total":1000004,"customer_id":4}
{"order_id":10005,"total":1000005,"customer_id":5}
*/
//5:查询结果
+----+----------+---------+---------+-------+---------------------+-------------------------+
| op | order_id | total | country | zip | order_time | create_time |
+----+----------+---------+---------+-------+---------------------+-------------------------+
| +I | 10001 | 1000001 | 中国 | zip01 | 2023-07-07 16:47:16 | 2023-07-07 16:37:24.257 |
| +I | 10003 | 1000003 | 新加坡 | zip03 | 2023-07-07 16:53:52 | 2023-07-07 16:37:26.260 |
| +I | 10002 | 1000002 | 加拿大 | zip02 | 2023-07-07 16:53:52 | 2023-07-07 16:37:24.257 |
+----+----------+---------+---------+-------+---------------------+-------------------------+
// 6:维表插入数据 4、5
+----+----+------+----------+------+-------------------------+
| op | id | name | country | zip | create_time |
+----+----+------+----------+------+-------------------------+
| +I | 1 | 001 | 中国 |zip01 | 2023-07-07 16:37:24.257 |
| +I | 2 | 002 | 加拿大 |zip02 | 2023-07-07 16:37:24.257 |
| +I | 3 | 003 | 新加坡 |zip03 | 2023-07-07 16:37:26.260 |
| +I | 4 | 004 | 日本 |zip04 | 2023-07-07 16:56:03.695 |
| +I | 5 | 005 | 英国 |zip05 | 2023-07-07 16:56:04.455 |
+----+----+------+----------+------+-------------------------+
//发现步骤5的数据没变化
结论
lookup join 关联不上的数据不会显示,即使后面数据又进到lookup表,同样不会再重新显示
如:id 4,5 两条数据先进到order表,此时customers没有4,5数据关联不上,不会输出4,5 数据。接下来
customers 表插入customer_id 4,5的数据发现控制台无变化
If the records of Orders (main table) join missing because the corresponding data of customers (lookup table) is not ready. You can consider using Flink’s Delayed Retry Strategy For Lookup.
开启 Delayed Retry Strategy For Lookup
SELECT /*+ LOOKUP('table'='c', 'async'='true', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='10s','max-attempts'='3')*/o.order_id,
o.total,
c.country,
c.zip,
o.order_time,
c.create_time
FROM Orders AS o
JOIN customers
FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id
1:清空数据
2:Orders 表插入数据{“order_id”:10001,“total”:1000001,“customer_id”:1}
3:customers表插入数据{“id”:1,“name”:“001”,“country”:“中国”,“zip”:“zip01”}
4:Orders 表插入数据{“order_id”:10002,“total”:1000002,“customer_id”:2}
5:10s之后customers表插入数据{“id”:2,“name”:“002”,“country”:“加拿大”,“zip”:“zip02”}
6:Orders 表插入数据{“order_id”:10003,“total”:1000003,“customer_id”:3}
7:30s之后customers表插入数据{“id”:3,“name”:“003”,“country”:“新加坡”,“zip”:“zip03”}
8:查询结果
+----+---------+--------+--------+------+--------------------+-------------------------+
| op |order_id | total |country | zip | order_time | create_time |
+----+---------+--------+--------+------+--------------------+-------------------------+
| +I | 10001 |1000001 | 中国 |zip01 |2023-07-07 17:29:03 | 2023-07-07 17:29:12.981 |
| +I | 10002 |1000002 | 加拿大 |zip02 |2023-07-07 17:30:11 | 2023-07-07 17:30:28.257 |
+----+---------+--------+--------+------+--------------------+-------------------------+
9:结论
在指定Delayed Retry Strategy之后可以在规则内正常处理晚到的数据,按照上述配置两条数据相差超过30s就会被丢弃