Deduplication
Batch Streaming
Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM, COUNT - so deduplication is needed before further analysis.
重复数据消除删除在一组列上重复的行,只保留第一行或最后一行。在某些情况下,上游ETL作业不是精确一次端到端的;如果发生故障转移,这可能导致接收器中的记录重复。并且,重复的记录将影响下游分析作业的正确性(例如SUM、COUNT),因此在进一步分析之前需要进行重复数据消除。
Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
Flink使用ROW_NUMBER()来删除重复项,就像Top-N查询一样。理论上,重复数据消除是Top-N的一种特殊情况,其中N为1,并按处理时间或事件时间排序。
The following shows the syntax of the Deduplication statement: