Flink SQL:Queries(Deduplication)

曹木青芸

已于 2023-05-21 14:13:37 修改

阅读量390

点赞数

CC 4.0 BY-SA版权

分类专栏： flink官方文档翻译-SQL 文章标签： sql flink 数据库

于 2022-10-30 18:12:04 首次发布

本文链接：https://blog.csdn.net/weixin_48813624/article/details/127602386

Deduplication

Batch Streaming

Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM, COUNT - so deduplication is needed before further analysis.
重复数据消除删除在一组列上重复的行，只保留第一行或最后一行。在某些情况下，上游ETL作业不是精确一次端到端的；如果发生故障转移，这可能导致接收器中的记录重复。并且，重复的记录将影响下游分析作业的正确性(例如SUM、COUNT)，因此在进一步分析之前需要进行重复数据消除。

Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
Flink使用ROW_NUMBER()来删除重复项，就像Top-N查询一样。理论上，重复数据消除是Top-N的一种特殊情况，其中N为1，并按处理时间或事件时间排序。

The following shows the syntax of the Deduplication statement: