abp 通用sql
论数据工程 (On Data Engineering)
ROW_NUMBER
is one of the most valuable and versatile functions in SQL. It can be leveraged for different use cases, from ranking items, identifying data quality gaps, doing some minimization, handling preference queries, or helping with sessionization etc.
ROW_NUMBER
是SQL中最有价值和最通用的功能之一。 它可以用于不同的用例,例如对项目进行排名,识别数据质量差距,进行最小化,处理首选项查询或帮助进行会话化等。
The ROW_NUMBER
function isn’t, however, a traditional function. It is a window function. Window functions are an advanced kind of function, with specific properties. This article aims to go over how window functions, and more specifically, how the ROW_NUMBER
function work, and to go over some of the use cases for the ROW_NUMBER
function.
ROW_NUMBER
函数不是传统函数。 这是一个窗口功能。 窗口函数是具有特定属性的高级函数。 本文的目的是检查如何窗口的功能,更具体地说,是怎么ROW_NUMBER
功能工作,去了一些使用案例ROW_NUMBER
功能。
窗函数剖析 (Anatomy of a Window function)
To understand how a window function work, it is essential first to understand, what type of arguments it can take. The following is the syntax for providing an argument using the window function.
要了解窗口函数的工作原理,首先必须了解其可以采用的参数类型。 以下是使用window函数提供参数的语法。
ROW_NUMBER() OVER (window)
ROW_NUMBER()个(窗口)
As you can see, the row number doesn’t take a direct argument. We don’t have a ROW_NUMBER(a.columna)
, for instance, but takes arguments in the OVER
clause. The argument it takes is called a window. The window defines a subset of the dataset to use for the computation. Different arguments can be used to define this window, partitions, orders, rows between.
如您所见,行号没有直接的参数。 例如,我们没有ROW_NUMBER(a.columna)
,但是在OVER
子句中使用了参数。 它采用的参数称为窗口。 该窗口定义了要用于计算的数据集的子集。 可以使用不同的参数来定义此窗口,分区,顺序,之间的行。
Other window functions may also include direct arguments like traditional functions, such as the SUM
window function, e.g.: SUM(amount) OVER (window)
, in which case we would be summing the amount over a subset of the data as defined by the window.
其他窗口函数也可能包括直接参数,例如传统函数,例如SUM
窗口函数,例如: SUM(amount) OVER (window)
,在这种情况下,我们将对窗口定义的数据子集的数量求和。
PARTITION BY
分区依据
The PARTITION BY
argument allows us to split the dataset. The typical way to uses it is to specify the list of columns on which we would like to start a new count on:
PARTITION BY
参数允许我们分割数据集。 典型的使用方式是指定要在其上开始新计数的列的列表:
SELECT
a.client_id,
ROW_NUMBER() OVER(PARTITION BY a.country, a.city) AS rn_city
The above statement would, for instance, gives us, for each client, a row number from 1 to n (number of client in the city).
例如,上面的语句为每个客户提供了从1到n(城市中客户的数量)的行号。
The partition by clause can, however, accept more complicated expressions. This is better shown using a SUM
window function rather than a ROW_NUMBER
function.
但是,partition by子句可以接受更复杂的表达式。 使用SUM
窗口函数而不是ROW_NUMBER
函数可以更好地显示这一点。
-- SQLite Syntax
WITH
x AS (
SELECT 1 AS t
UNION ALL
SELECT 2 AS t
UNION ALL
SELECT 2 AS t
UNION ALL
SELECT 3 AS t)SELECT
t, SUM(t) OVER (PARTITION BY CASE WHEN t = 1 THEN t ELSE NULL END ) AS s
FROM x;
The result of the query is the following:
查询结果如下:
t|s|
2|7
2|7
3|7
1|1
What the query does is handling the SUM with a partition set for t=1, and another for the rest of the query (NULL). The split between the dataset happens after the evaluation from the case statement query.
该查询所执行的操作是处理SUM,并将其分区设置为t = 1,其余分区设置为查询的其余部分(NULL)。 数据集之间的拆分发生在案例语句查询的评估之后。
Since for t=1, we have one partition equal to t (which can have only one value), we have a first group (sub-dataset) to do computation on. For t != 1, there is only one option is to be part of the group with an output as a NULL
value. Since this group is composed of 2 records with t=2 and one record with t=3, the sum for the group is equal to 7.
因为对于t = 1,我们有一个等于t的分区(只能有一个值),所以我们有第一组(子数据集)进行计算。 对于t!= 1,只有一个选项是成为组的一部分,并且输出为NULL
值。 由于该组由t = 2的2条记录和t = 3的1条记录组成,因此该组的总和等于7。
If we replaced the window function with the following:
如果我们将窗口函数替换为以下内容:
PARTITION BY CASE WHEN t <= 2 THEN ELSE null END
We would generate three groups to split the data into t=1, t=2, and t>2. The respective sums would be 1,4 and 3.
我们将生成三组以将数据分为t = 1,t = 2和t> 2。 分别为1,4和3。
The same type of operations can also be performed to compute the row numbers.
还可以执行相同类型的操作来计算行号。
订购 (ORDER BY)
The order by argument will define, for the purpose of this specific function, how the dataset will be sorted. A simple ROW_NUMBER query such as the following will only be providing a sorted dataset by value with the associate row_number as if it was a full dataset:
为了实现此特定功能,按参数排序将定义如何对数据集进行排序。 像下面这样的简单ROW_NUMBER查询将仅提供按值排序的数据集以及相关联的row_number,就好像它是完整数据集一样:
SELECT
*,
ROW_NUMBER() OVER (ORDER BY amount) AS rn
The ORDER BY
window argument can like the general query order by support ascending (ASC) or descending modifiers (DESC).
通过支持升序(ASC)或降序修饰符(DESC), ORDER BY
窗口参数可以像一般查询顺序一样。
SELECT
*,
ROW_NUMBER() OVER (ORDER BY amount DESC) AS rn
Other supported modifiers are related to the treatment of null values. We can select if null values should be considered first (NULLS FIRST
)or last (NULLS LAST
).
其他受支持的修饰符与空值的处理有关。 我们可以选择是否应该首先考虑空值( NULLS FIRST
)或最后一个( NULLS LAST
)。
SELECT
*,
ROW_NUMBER() OVER (ORDER BY amount DESC NULLS LAST) AS rn
Some dialects, such as T-SQL or SQLite, allow for the use of aggregate functions within the window for ordering purposes.
一些方言,例如T-SQL或SQLite,允许在窗口内使用聚合函数进行排序。
-- SQLite Syntax
WITH
x AS (
SELECT 1 as t
UNION ALL
SELECT 2 as t
UNION ALL
SELECT 2 as t
)
SELECT
t, COUNT(1) AS cnt, ROW_NUMBER() OVER (ORDER BY SUM(t) DESC) AS rn
FROM x
GROUP BY T;
The result of which is the following:
结果如下:
t|cnt|rn|
2|2|1
1|1|2
This, however, requires the use of a group by aggregation.
但是,这需要通过聚合使用组。
框架规格 (FRAME SPECIFICATION)
Besides the partition by clause, it is also possible to specify the subset of data to use based on what is called a frame specification. The frame specification will either take a subset of data based on the row placement within the partition or a numeric or temporal value.
除了partition by子句,还可以根据所谓的帧规范指定要使用的数据子集。 帧规范将基于分区中行的位置或数字或时间值获取数据的子集。
The frame specification is typically placed after a ORDER BY
clause, and is generally started with either a ROW or RANGE operator. Each takes an indication of how many units before and after the current row to use to calculate the output of the function. Some examples of this are ROWS 5 PRECEDING AND 1 FOLLOWING
, RANGE 1 PRECEDING AND CURRENT ROW
or RANGE INTERVAL 5 DAY PRECEDING AND 0 DAY FOLLOWING.
It can also take unbounded arguments, for example:ROWS UNBOUNDED PRECEDING AND CURRENT ROW
. An example query making use of this frame specification is provided below using a SUM
window function for illustrative purpose:
帧规范通常放置在ORDER BY
子句之后,并且通常以ROW或RANGE运算符开始。 每个参数都指示当前行之前和之后多少个单位可用于计算函数的输出。 这样的一些示例是行ROWS 5 PRECEDING AND 1 FOLLOWING
,行RANGE 1 PRECEDING AND CURRENT ROW
或行RANGE INTERVAL 5 DAY PRECEDING AND 0 DAY FOLLOWING.
它还可以采用无限制的参数,例如: ROWS UNBOUNDED PRECEDING AND CURRENT ROW
。 以下出于说明目的,使用SUM
窗口函数提供了使用此帧规范的示例查询:
WITH
x AS (
SELECT 1 AS t
UNION ALL
SELECT 2 AS t
UNION ALL
SELECT 3 AS t
UNION ALL
SELECT 4 AS t
)SELECT
t,
SUM(t) OVER (ORDER BY t ASC RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) AS s
FROM x;
----
t|s
1|1 (1 + nothing)
2|3 (2+1)
3|5 (3+2)
4|7 (4+3)
窗口别名 (WINDOW ALIASING)
When leveraging multiple window functions in the same query, it is possible to render its content through a window alias. Windows can be aliased defining them after the HAVING statement (if used) or if not used, a used statement occurring just before in the SQL evaluation order (FROM/WHERE/GROUP BY). An example of window aliasing is shown below:
在同一查询中利用多个窗口函数时,可以通过窗口别名来呈现其内容。 Windows可以在HAVING语句之后(如果已使用)或如果未使用(在SQL评估顺序(FROM / WHERE / GROUP BY)之前出现的已使用语句)之后为Windows定义别名。 窗口别名的示例如下所示:
WITH
x AS (
SELECT 1 AS t
UNION ALL
SELECT 2 AS t
)
SELECT
t,
SUM(t) OVER (w) AS s,
ROW_NUMBER() OVER (w) AS rn
FROM x
WINDOW w AS (ORDER BY t DESC);
----
t|s|rn
3|3|1
2|5|2
1|6|3
排行 (Ranking)
One of the typical use cases of the ROW_NUMBER function is that of ranking records. Other functions exist to rank values in SQL, such as the RANK and DENSE_RANK functions. It is essential to understand their particularities and differences.
ROW_NUMBER函数的典型用例之一是对记录进行排名。 存在其他用于对SQL中的值进行排名的函数,例如RANK和DENSE_RANK函数。 必须了解它们的特殊性和差异。
ROW_NUMBER
: Provides a unique and strictly increasing by one ranking for each recordROW_NUMBER
:为每条记录提供唯一且严格按排名增加的排名RANK
: Non-unique ranking (multiple records can share the same rank), increasing by the number of records in the previous rank, this means that the function can skip some ranks.RANK
:非唯一排名(多个记录可以共享相同的排名),并增加前一个排名中的记录数量,这意味着该功能可以跳过某些排名。DENSE_RANK
: Non-unique ranking, increasing by 1 for each change of ranksDENSE_RANK
:非唯一排名,每更改一次排名就增加1
An example query shows how the different functions would behave:
查询示例显示了不同功能的行为:
WITH
x AS (
SELECT 1 AS t
UNION ALL
SELECT 2 AS t
UNION ALL
SELECT 2 AS t
)
SELECT
t,
SUM(t) OVER (w) AS s,
ROW_NUMBER() OVER (w) AS rn,
RANK() OVER (w) AS rank_n,
DENSE_RANK() OVER (w) AS drank_nFROM x
WINDOW w AS (ORDER BY t DESC);
---t|s|rn|rank_n|drank_n
2|4|1 |1 |1
2|4|2 |1 |1
1|5|3 |3 |2
The uniqueness property of ROW_NUMBER
is one of its’ most significant advantages. Since we know that there can be at most one record for a given value of a ROW_NUMBER
We should be able to join on it within worrying about cardinality safely.
ROW_NUMBER
的唯一性属性是其最重要的优势之一。 由于我们知道给定值ROW_NUMBER
最多可以有一个记录,因此我们应该可以在担心基数安全的情况下加入该记录。
最大化/最小化 (Maximization / Minimization)
The ROW_NUMBER
function can be used for minimization or maximization on the dataset. Using, it is possible to get some ARG MAX.
ROW_NUMBER
函数可用于数据集的最小化或最大化。 使用,可以获得一些ARG MAX 。
Let say we have been asked to find the vehicle that has been able to travel the fastest between the route of Paris to Amsterdam. We are interested in knowing the model and brand of the car that traveled the fastest. We can use the ROW_NUMBER
function to help us in this calculation. The following query would provide us with this type of calculation:
假设我们被要求找到能够在巴黎到阿姆斯特丹之间最快行驶的车辆。 我们有兴趣了解行驶速度最快的汽车的型号和品牌。 我们可以使用ROW_NUMBER
函数来帮助我们进行此计算。 以下查询将为我们提供这种类型的计算:
WITH
stg AS (
SELECT
a.vehicle_id,
b.model,
b.brand,
ROW_NUMBER() OVER (ORDER BY a.ts ASC) AS rn
FROM routes r
LEFT JOIN dim_vechile b
ON a.vehicle_id = b.vehicle_id
WHERE r.from = 'PARIS' r.to = 'AMSTERDAM'
)SELECT
vehicle_id, model, brand
FROM stg
WHERE rn = 1
偏好查询 (Preference Queries)
There can be cases where it is needed to have some mutually exclusive preference across the records. For instance, if you are provided a list of users’ contact details, and need to select them in the most cost-effective manner, preferring, for instance, to send them an email rather than giving them a phone call or preferring to phone them rather than to send them a snail mail.
在某些情况下,需要在记录之间具有某些互斥的首选项。 例如,如果向您提供了用户的详细联系信息列表,并且需要以最具成本效益的方式选择他们,则更喜欢例如向他们发送电子邮件,而不是给他们打电话或更喜欢给他们打电话而不是向他们发送蜗牛邮件。
It is possible to implement these types of queries without window functions. However, this can lead to relatively long, complex, and inefficient queries. Take a look at the following query:
可以在没有窗口函数的情况下实现这些类型的查询。 但是,这可能会导致相对较长,复杂且效率低下的查询。 看一下以下查询:
WITH
base AS (
SELECT
user_id,
contact_type, -- email / phone / mailing address
contact_value
FROM contact_details
),
email_users AS (
SELECT
user_id,
contact_type,
contact_value
FROM base
WHERE contact_type = 'email'
),
phone_users AS (
SELECT
user_id,
contact_type,
contact_value
FROM base b
LEFT JOIN email_users e ON b.user_id = b.user_id
WHERE
contact_type = 'phone'
AND e.user_id IS NULL
),
maling_users AS (
SELECT
user_id,
contact_type,
contact_value
FROM base b
LEFT JOIN email_users e ON b.user_id = b.user_id
LEFT JOIN phone_users p ON b.user_id = p.user_id
WHERE
contact_type = 'mailing'
AND e.user_id IS NULL
)SELECT * FROM email_users
UNION ALL
SELECT * FROM phone_users
UNION ALL
SELECT * FROM maling_users
Using the ROW_NUMBER window function, this query can be better expressed using a preference query:
使用ROW_NUMBER窗口函数,可以使用首选项查询更好地表达此查询:
WITH
base AS (
SELECT
user_id,
contact_type, -- email / phone / mailing address
contact_value
FROM contact_details
),
preference_q AS (
SELECT
user_id,
contact_type,
contact_value,
ROW_NUMBER () OVER (
PARTITION BY user_id
ORDER BY
CASE
WHEN contact_type = 'email' THEN 1
WHEN contact_type = 'phone' THEN 2
WHEN contact_type = 'mailing' THEN 3
END
DESC
) AS rn
FROM base
)SELECT * FROM preference_q WHERE rn = 1
This approach has the following advantages:
此方法具有以下优点:
Short: The query is significantly more condensed than without a ROW_NUMBER window function, making it easier to read or modify as requirements evolve.
简短:与没有ROW_NUMBER窗口函数的情况相比,查询的压缩程度显着提高,从而使得随着需求的发展更易于阅读或修改。
Simplicity: The query in itself is expressed in quite a simple way; no need to go back and forth to understand what is getting filtered or combined at different steps in the process.
简便性:查询本身以非常简单的方式表示。 无需来回了解在该过程的不同步骤中正在过滤或合并的内容。
Performance: In this query, instead of doing three pass-through the data + needing to join on these different tables, we merely need to sort through the data to obtain the records that we seek.
性能:在此查询中,我们无需对数据进行三遍传递,而无需在这些不同的表上进行联接,我们只需要对数据进行排序以获得所需的记录。
数据质量(重复数据删除) (Data Quality (Deduplication))
ROW_NUMBER
provides one of the best tools to deduplicate values, for instance, when needing to deal with duplicate data being loaded onto a table. It allows us to select only one record from each duplicate set. To deduplicate, the critical thing to do is to incorporate all the fields that are meant to represent the “uniqueness” within the PARTITION BY argument:
ROW_NUMBER
提供了最好的工具来对值进行重复数据删除,例如,当需要处理被加载到表上的重复数据时。 它使我们只能从每个重复集中选择一个记录。 要进行重复数据删除,关键的事情是在PARTITION BY参数中合并所有表示“唯一性”的字段:
WITH
stg AS (
SELECT
ts,
session_id,
customer_id,
country,
action_type,
action_value,
extra_data,
ROW_NUMBER() OVER (w) AS rn
FROM table
WINDOW w AS (
PARTITION BY
ts, session_id, customer_id, country,
action_type, action_value, extra_data
)
)
SELECT
*
FROM stg
WHERE rn = 1
数据质量(命中) (Data Quality (Hit))
In some cases, we can leverage the ROW_NUMBER function to identify data quality gaps. This is the case, for instance, when leveraging clickstream data making use of a “hit number” indicator. Vendor provided solutions, such as Google Analytics, to make use of the “hit count” generated client-side. These “hits” represent events that need to be sent to the server. Missing hits numbers therefore represent some events that should have been sent but did not end up being collected in the database.
在某些情况下,我们可以利用ROW_NUMBER函数来识别数据质量差距。 例如,在利用点击数指示器利用点击流数据时就是这种情况。 供应商提供了解决方案,例如Google Analytics(分析),以利用客户端生成的“ 点击计数 ”。 这些“命中”表示需要发送到服务器的事件。 因此,缺少的匹配数表示一些应发送但并未最终被收集到数据库中的事件。
Sometimes, it is possible to reconstruct these events artificially. The ROW_NUMBER function helps to identify where these data gaps occur. A test can be implemented leveraging the ROW_NUMBER
and LAG
window functions, to identify events within the data that first come out of sequence. The LAG
window function takes the N preceding value (by default 1) in the window. This is exemplified in the following query:
有时,可以人为地重建这些事件。 ROW_NUMBER函数有助于确定这些数据缺口发生的位置。 可以利用ROW_NUMBER
和LAG
窗口函数来实施测试,以识别数据中首先出现的事件。 LAG
窗口函数在窗口中采用N个前值(默认为1)。 在以下查询中对此进行了举例说明:
WITH
x AS (
SELECT 2 AS session_id, 1 AS hit_count
UNION ALL
SELECT 2 AS session_id, 3 AS hit_count
UNION ALL
SELECT 2 AS session_id, 4 AS hit_count
UNION ALL
SELECT 2 AS session_id, 6 AS hit_count
UNION ALL
SELECT 2 AS session_id, 7 AS hit_count
UNION ALL
SELECT 2 AS session_id, 9 AS hit_count
UNION ALL
SELECT 2 AS session_id, 10 AS hit_count
UNION ALL
SELECT 2 AS session_id, 11 AS hit_count
UNION ALL
SELECT 2 AS session_id, 12 AS hit_count
UNION ALL
SELECT 2 AS session_id, 14 AS hit_count
),
stg AS (
SELECT
session_id,
hit_count,
ROW_NUMBER() OVER (w) AS rn,
LAG(hit_count) OVER (w) AS lhc
FROM x
WINDOW w AS (PARTITION BY session_id)
)SELECT
session_id, hit_count, rn, lhc +1,
CASE
WHEN hit_count <> rn AND (hit_count <> lhc+1 OR lhc IS NULL) THEN 'NOK'
ELSE 'OK'
END AS test
FROM stg;
---
session_id |hit_count |rn |lhc |test
2 |1 |1 | |OK
2 |3 |2 |2 |NOK
2 |4 |3 |4 |OK
2 |6 |4 |5 |NOK
2 |7 |5 |7 |OK
2 |9 |6 |8 |NOK
2 |10 |7 |10 |OK
2 |11 |8 |11 |OK
2 |12 |9 |12 |OK
2 |14 |10 |13 |NOK
After having identified the events that are “out of sync,” it is possible to do a second pass on the dataset to apply a transformation fix. This is typically done by looking at the previous row available (preceding RN) and the current row to generate the artificial events that should have happened or were likely to have occurred.
识别出“不同步”的事件之后,可以对数据集进行第二遍处理以应用转换修复。 通常,通过查看可用的前一行(RN之前)和当前行以生成应该已经发生或可能已经发生的人工事件来完成此操作。
会话化 (Sessionization)
Another place where ROW_NUMBER
can help is in performing sessionization. Different rules can be implemented to generate the sessionization. One of the most straightforward rules is that the session needs to happen on the same calendar day.
ROW_NUMBER
可以提供帮助的另一个地方是执行会话化。 可以实施不同的规则来生成会话。 最直接的规则之一是会话需要在同一日历日进行。
An example of how we can use the ROW_NUMBER
function to create this event sessionization is provided in the query below:
以下查询提供了有关如何使用ROW_NUMBER
函数创建此事件会话化的示例:
WITH
x AS (
SELECT 946684800 AS ts, 1 AS hit_count --2000-01-01 00:00:00
UNION ALL
SELECT 946684850 AS ts, 3 AS hit_count
UNION ALL
SELECT 946684950 AS ts, 4 AS hit_count
UNION ALL
SELECT 946685000 AS ts, 6 AS hit_count
UNION ALL
SELECT 946685010 AS ts, 7 AS hit_count
UNION ALL
SELECT 946685050 AS ts, 9 AS hit_count
UNION ALL
SELECT 946685080 AS ts, 10 AS hit_count
UNION ALL
SELECT 946774800 AS ts, 11 AS hit_count --2000-01-02 01:00:00
UNION ALL
SELECT 946774850 AS ts, 12 AS hit_count
UNION ALL
SELECT 946774900 AS ts, 14 AS hit_count
),
stg AS (
SELECT
lower(hex(randomblob(16))) as id, -- (generate a random id)
datetime(ts, 'unixepoch') AS dt,
hit_count,
ROW_NUMBER() OVER (
PARTITION BY date(ts, 'unixepoch')
) AS rn
FROM x
)SELECT
CASE
WHEN rn = 1 THEN id
ELSE LAG(id, rn-1) OVER (ORDER BY dt ASC)
END AS session_id,
dt,
hit_count,
rn
FROM stg;
---
session_id |dt |hit_count|rn
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:00:00|1 |1
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:00:50|3 |2
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:02:30|4 |3
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:03:20|6 |4
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:03:30|7 |5
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:04:10|9 |6
3bfe5a7ceb53847ae59aaee1e86e7d80|2000-01-01 00:04:40|10 |7
31f192df8f9f54e3ef50e12c5b37899f|2000-01-02 01:00:00|11 |1
31f192df8f9f54e3ef50e12c5b37899f|2000-01-02 01:00:50|12 |2
31f192df8f9f54e3ef50e12c5b37899f|2000-01-02 01:01:40|14 |3
摘要 (Summary)
ROW_NUMBER
is one of the most useful SQL functions to master for data engineers. It has a wide range of applications and often provides a simple path to handle some of the typical data engineering problems such as deduplication, sessionization, or dealing with preference queries.
ROW_NUMBER
是供数据工程师掌握的最有用SQL函数之一。 它具有广泛的应用程序,并且通常提供一条简单的路径来处理一些典型的数据工程问题,例如重复数据删除,会话化或处理首选项查询。
More from me on Hacking Analytics:
我提供的有关Hacking Analytics的更多信息:
abp 通用sql