Flink SQL:Queries(Deduplication)

Deduplication

Batch Streaming

Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM, COUNT - so deduplication is needed before further analysis.
重复数据消除删除在一组列上重复的行,只保留第一行或最后一行。在某些情况下,上游ETL作业不是精确一次端到端的;如果发生故障转移,这可能导致接收器中的记录重复。并且,重复的记录将影响下游分析作业的正确性(例如SUM、COUNT),因此在进一步分析之前需要进行重复数据消除。

Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
Flink使用ROW_NUMBER()来删除重复项,就像Top-N查询一样。理论上,重复数据消除是Top-N的一种特殊情况,其中N为1,并按处理时间或事件时间排序。

The following shows the syntax of the Deduplication statement:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值