Deduplication
Batch Streaming
Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM, COUNT - so deduplication is needed before further analysis.
重复数据消除删除在一组列上重复的行,只保留第一行或最后一行。在某些情况下,上游ETL作业不是精确一次端到端的;如果发生故障转移,这可能导致接收器中的记录重复。并且,重复的记录将影响下游分析作业的正确性(例如SUM、COUNT),因此在进一步分析之前需要进行重复数据消除。
Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
Flink使用ROW_NUMBER()来删除重复项,就像Top-N查询一样。理论上,重复数据消除是Top-N的一种特殊情况,其中N为1,并按处理时间或事件时间排序。
The following shows the syntax of the Deduplication statement:
以下显示了Deduplication语句的语法:
SELECT [column_list]
FROM (
SELECT [column_list],
ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
ORDER BY time_attr [asc|desc]) AS rownum
FROM table_name)
WHERE rownum = 1
Parameter Specification:
- ROW_NUMBER(): Assigns an unique, sequential number to each row, starting with one.
ROW_NUMBER():为每一行分配一个唯一的序列号,从一开始。 - PARTITION BY col1[, col2…]: Specifies the partition columns, i.e. the deduplicate key.
PARTITION BY col1[, col2…]:指定分区列,即重复数据消除键。 - ORDER BY time_attr [asc|desc]: Specifies the ordering column, it must be a time attribute. Currently Flink supports processing time attribute and event time attribute. Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row.
ORDER BY time_attr [asc|desc]:指定排序列,它必须是时间属性。目前Flink支持处理时间属性和事件时间属性。按ASC排序意味着保留第一行,按DESC排序意味着保持最后一行。 - WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.
WHERE rownum = 1:Flink需要rownum=1才能识别此查询是重复数据消除。
Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query.
注意:必须严格遵循上述模式,否则优化器将无法翻译查询。
The following examples show how to specify SQL queries with Deduplication on streaming tables.
以下示例显示了如何在流表上使用重复数据消除指定SQL查询。
CREATE TABLE Orders (
order_id STRING,
user STRING,
product STRING,
num BIGINT,
proctime AS PROCTIME()
) WITH (...);
-- remove duplicate rows on order_id and keep the first occurrence row,
-- because there shouldn't be two orders with the same order_id.
SELECT order_id, user, product, num
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) AS row_num
FROM Orders)
WHERE row_num = 1