Flink SQL:Queries(Deduplication)

Deduplication

Batch Streaming

Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM, COUNT - so deduplication is needed before further analysis.
重复数据消除删除在一组列上重复的行,只保留第一行或最后一行。在某些情况下,上游ETL作业不是精确一次端到端的;如果发生故障转移,这可能导致接收器中的记录重复。并且,重复的记录将影响下游分析作业的正确性(例如SUM、COUNT),因此在进一步分析之前需要进行重复数据消除。

Flink uses ROW_NUMBER() to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
Flink使用ROW_NUMBER()来删除重复项,就像Top-N查询一样。理论上,重复数据消除是Top-N的一种特殊情况,其中N为1,并按处理时间或事件时间排序。

The following shows the syntax of the Deduplication statement:
以下显示了Deduplication语句的语法:

SELECT [column_list]
FROM (
   SELECT [column_list],
     ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
       ORDER BY time_attr [asc|desc]) AS rownum
   FROM table_name)
WHERE rownum = 1

Parameter Specification:

  • ROW_NUMBER(): Assigns an unique, sequential number to each row, starting with one.
    ROW_NUMBER():为每一行分配一个唯一的序列号,从一开始。
  • PARTITION BY col1[, col2…]: Specifies the partition columns, i.e. the deduplicate key.
    PARTITION BY col1[, col2…]:指定分区列,即重复数据消除键。
  • ORDER BY time_attr [asc|desc]: Specifies the ordering column, it must be a time attribute. Currently Flink supports processing time attribute and event time attribute. Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row.
    ORDER BY time_attr [asc|desc]:指定排序列,它必须是时间属性。目前Flink支持处理时间属性和事件时间属性。按ASC排序意味着保留第一行,按DESC排序意味着保持最后一行。
  • WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.
    WHERE rownum = 1:Flink需要rownum=1才能识别此查询是重复数据消除。

Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query.
注意:必须严格遵循上述模式,否则优化器将无法翻译查询。

The following examples show how to specify SQL queries with Deduplication on streaming tables.
以下示例显示了如何在流表上使用重复数据消除指定SQL查询。

CREATE TABLE Orders (
  order_id  STRING,
  user        STRING,
  product     STRING,
  num         BIGINT,
  proctime AS PROCTIME()
) WITH (...);

-- remove duplicate rows on order_id and keep the first occurrence row,
-- because there shouldn't be two orders with the same order_id.
SELECT order_id, user, product, num
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) AS row_num
  FROM Orders)
WHERE row_num = 1
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值