PostgreSQL 强大的新联合类型 —— LATERAL

最新推荐文章于 2024-06-18 17:56:45 发布

carcarrot

最新推荐文章于 2024-06-18 17:56:45 发布

阅读量3.4k

点赞数 1

分类专栏：数据库

原文链接：https://www.oschina.net/translate/postgresqls-powerful-new-join-type-lateral?cmp

版权

数据库专栏收录该内容

36 篇文章

订阅专栏

PostgreSQL 强大的新联合类型 —— LATERAL

PostgreSQL 9.3 用了一种新的联合类型! Lateral联合的推出比较低调，但它实现了之前需要使用编写程序才能获得的强大的新查询. 在本文中, 我将会介绍一个在 PostgreSQL 9.2 不可能被实现的渠道转换分析.

什么是 LATERAL 联合?

对此的最佳描述在文档中可选 FROM 语句清单的底部:

LATERAL 关键词可以在前缀一个 SELECT FROM 子项. 这能让 SELECT 子项在FROM项出现之前就引用到FROM项中的列. (没有 LATERAL 的话, 每一个 SELECT 子项彼此都是独立的，因此不能够对其它的 FROM 项进行交叉引用.)
…
当一个 FROM 项包含 LATERAL 交叉引用的时候，查询的计算过程如下: 对于FROM像提供给交叉引用列的每一行，或者多个FROM像提供给引用列的行的集合, LATERAL 项都会使用行或者行的集合的列值来进行计算. 计算出来的结果集像往常一样被加入到联合查询之中. 这一过程会在列的来源表的行或者行的集合上重复进行.

这种计算有一点密集。你可以比较松散的将 LATERAL 联合理解作一个 SQL 的foreach 选择, 在这个循环中 PostgreSQL 将循环一个结果集中的每一行，并将那一行作为参数来执行一次子查询的计算.

我们可以用这个来干些什么?

看看下面这个用来记录点击事件的表结构:

CREATE TABLE event (
    user_id BIGINT,
    event_id BIGINT,
    time BIGINT NOT NULL,
    data JSON NOT NULL,
    PRIMARY KEY (user_id, event_id)
)

每一个事件都关联了一个用户，拥有一个ID，一个时间戳，还有一个带有事件属性的JSON blob. 在堆中，这些属性可能包含一次点击的DOM层级, 窗口的标题，会话引用等等信息.

加入我们要优化我们的登录页面以增加注册. 第一步就是要计算看看我们的哪个渠道转换上正在丢失用户.

An example conversion funnel between four steps in a signup flow.
示例：一个注册流程的个步骤之间的渠道转换率.

假设我们已经在前端配备的装置，来沿着这一流程来记录事件日志，所有的数据都会保存到上述的事件数据表中.[1] 最开始的问题是，我们要计算有多少人查看了我们的主页，而他们之中有百分之多少在那次查看了主页之后的两个星期之内输入了验证信息. 如果我们使用 PostgreSQL 较老的版本, 我们可能需要使用 PL/pgSQL这一PostgreSQL内置的过程语言来编写一些定制的函数. 而在 9.3 中, 我们就可以使用一个 lateral 联合，只用一个搞笑的查询就能计算出结果，不需要任何扩展或者 PL/pgSQL.

SELECT
user_id,
view_homepage,
view_homepage_time,
enter_credit_card,
enter_credit_card_time
FROM (
-- Get the first time each user viewed the homepage.
SELECT
user_id,
1 AS view_homepage,
min(time) AS view_homepage_time
FROM event
WHERE
data->>'type' = 'view_homepage'
GROUP BY user_id
) e1 LEFT JOIN LATERAL (
-- For each row, get the first time the user_id did the enter_credit_card
-- event, if one exists within two weeks of view_homepage_time.
SELECT
1 AS enter_credit_card,
time AS enter_credit_card_time
FROM event
WHERE
user_id = e1.user_id AND
data->>'type' = 'enter_credit_card' AND
time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
ORDER BY time
LIMIT 1
) e2 ON true

没有人会喜欢30多行的SQL查询，所以让我们将这些SQL分成片段来分析。第一块是一段普通的 SQL:

SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
FROM event
WHERE
    data->>'type' = 'view_homepage'
GROUP BY user_id

也就是要获取到每个用户最开始触发 view_homepage 事件的时间. 然后我们的 lateral 联合就可以让我们迭代结果集的每一行，并会在接下来执行一次参数化的子查询. 这就等同于针对结果集的每一行都要执行一边下面的这个查询:

SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
FROM event
WHERE
    user_id = e1.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
ORDER BY time
LIMIT 1

例如,对于每一个用户，要获取他们在触发 view_homepage_time 事件后的两星期内触发 enter_credit_card 事件的时间. 因为这是一个lateral联合，我们的子查询就可以从之前的子查询出引用到 view_homepage_time 结果集. 否则，子查询就只能单独执行，而没办法访问到另外一个子查询所计算出来的结果集.

之后哦我们整个封装成一个select，它会返回像下面这样的东西:

user_id | view_homepage | view_homepage_time | enter_credit_card | enter_credit_card_time
---------+---------------+--------------------+-------------------+------------------------
567 | 1 | 5234567890 | 1 | 5839367890
234 | 1 | 2234567890 | |
345 | 1 | 3234567890 | |
456 | 1 | 4234567890 | |
678 | 1 | 6234567890 | |
123 | 1 | 1234567890 | |
...

因为这是一个左联合，所以查询结果集中会有不匹配 enter_credit_card 事件的行，只要有 view_homepage 事件就行. 如果我们汇总所有的数值列，就会得到渠道转换的一个清晰汇总:

SELECT
    sum(view_homepage) AS viewed_homepage,
    sum(enter_credit_card) AS entered_credit_card
FROM (
    -- Get the first time each user viewed the homepage.
    SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
    FROM event
    WHERE
    data->>'type' = 'view_homepage'
    GROUP BY user_id
) e1 LEFT JOIN LATERAL (
    -- For each (user_id, view_homepage_time) tuple, get the first time that
    -- user did the enter_credit_card event, if one exists within two weeks.
    SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
    FROM event
    WHERE
    user_id = e1.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
    ORDER BY time
    LIMIT 1
) e2 ON true

… 它会输出:

viewed_homepage | entered_credit_card
-----------------+---------------------
827 | 103

我们可以向这个渠道中填入带有更多lateral联合的中间步骤，来得到流程中我们需要重点改进的部分.[2] 让我们在查看主页和输入验证信息之间加入对使用示例步骤的查询.

SELECT
    sum(view_homepage) AS viewed_homepage,
    sum(use_demo) AS use_demo,
    sum(enter_credit_card) AS entered_credit_card
FROM (
    -- Get the first time each user viewed the homepage.
    SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
    FROM event
    WHERE
    data->>'type' = 'view_homepage'
    GROUP BY user_id
) e1 LEFT JOIN LATERAL (
    -- For each row, get the first time the user_id did the use_demo
    -- event, if one exists within one week of view_homepage_time.
    SELECT
    user_id,
    1 AS use_demo,
    time AS use_demo_time
    FROM event
    WHERE
    user_id = e1.user_id AND
    data->>'type' = 'use_demo' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*7)
    ORDER BY time
    LIMIT 1
) e2 ON true LEFT JOIN LATERAL (
    -- For each row, get the first time the user_id did the enter_credit_card
    -- event, if one exists within one week of use_demo_time.
    SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
    FROM event
    WHERE
    user_id = e2.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN use_demo_time AND (use_demo_time + 1000*60*60*24*7)
    ORDER BY time
    LIMIT 1
) e3 ON true

这样就会输出:

viewed_homepage | use_demo | entered_credit_card
-----------------+----------+---------------------
827 | 220 | 86

从查看主页到一周之内使用demo，再到一周以内向其输入信用卡信息，这就向我们提供了三个步骤的通道转换. 从此，功能强大的 PostgreSQL 使得我们可以深入分析这些数据结果集，并对我们的网站性能进行整体的分析. 接着我们可能会有下面这些问题要解决:

使用demo是否能增加注册的可能性?
通过广告找到我们主页的用户是否同来自其他渠道的用户拥有相同的转换率?
转换率会跟随不同的 A/B 测试变量发生怎样的变化?

这些问题的答案会直接影响到产品的改进，它们可以从 PostgreSQL 数据库中找到答案，因为现在它支持 lateral 联合.

没有 lateral 联合，我们就只能借助 PL/pgSQL 来做这些分析。或者，如果我们的数据集很小，我们可能就不会碰这些复杂、低效的查询. 在一项探索性数据研究使用场景下，你可能只是将数据从 PostgreSQL 里面抽取出来，并使用你所选择的脚本语言来对其进行分析。但是其实还存在更强大的理由来用SQL表述这些问题, 特别是如果你正想要把整个全封装到一套易于理解的UI中，并向非技术型用户发布功能的时候.

注意这些查询可以被优化，以变得更加高效. 在本例中，如果我们在 (user_id, (data->>’type’), time)上创建一个btree索引, 我们只用一次索引查找就能针对每一个用户计算每一个渠道步骤. 如果你使用的是SSD，在上面做查找花费是很小的，那这就足够了。而如果不是，你就可能需要用稍微不同的手段来图示化你的数据，详细的内容我会留到另外一篇文章之中进行介绍.

对新的PostgreSQL特性，或者整洁的 lateral 联合用例感兴趣吗? 联系我（@danlovesproofs）吧.

对构建能够使得强大的特性易于使用的系统感兴趣吗? 可以向位于 jobs@heapanalytics.com 的我们发送邮件.

[1] 或者我们可以使用Heap, 装好它我们就能得到所有的东西! 不需要编写任何日志代码，也没有遗忘记录之后你会拿来做分析的东西的风险.
[2] 注意如果我们使用想Heap这样的产品，向转换渠道中增加步骤可能将特别容易, 因为我们可能已经拥有了相关的数据.
[3] 有触发过 enter_credit_card 事件和之前相比要少, 因为这次查询只返回了用户在触发use_demo之后触发的 enter_credit_card 事件, 而还有17个用户没有使用demo就进行了注册.

(以下是原文 https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral)

PostgreSQL 9.3 has a new join type! Lateral joins arrived without a lot of fanfare, but they enable some powerful new queries that were previously only tractable with procedural code. In this post, I’ll walk through a conversion funnel analysis that wouldn’t be possible in PostgreSQL 9.2.

What is a LATERAL join?

The best description in the documentation comes at the bottom of the list of FROM clause options:

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)
…
When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set’s values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s).

This is a bit dense. Loosely, it means that a LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

What can we do with this?

Consider a table of click events with the following schema:

CREATE TABLE event (
    user_id BIGINT,
    event_id BIGINT,
    time BIGINT NOT NULL,
    data JSON NOT NULL,
    PRIMARY KEY (user_id, event_id)
)

Each event is associated with a user and has an ID, a timestamp, and a JSON blob with the event’s properties. At Heap, these properties might include the DOM hierarchy of a click, the window title, the session referrer, and so forth.

Let’s say we want to optimize our landing page to increase signups. The first step is to figure out where we’re losing users in our conversion funnel.

An example conversion funnel between four steps in a signup flow.

We’ll assume that we’ve instrumented our frontend to log events along this flow and that all of the data lives in the event table specified above.[1] As an initial question, let’s figure out how many people view our homepage and what percentage of them enter a credit card within two weeks of that initial homepage view. If we were using an older version of PostgreSQL, we might write some custom functions in PL/pgSQL, PostgreSQL’s builtin procedural language. But, in 9.3, we can use a lateral join to compute this in one efficient query, with no extensions or PL/pgSQL.

SELECT
user_id,
view_homepage,
view_homepage_time,
enter_credit_card,
enter_credit_card_time
FROM (
-- Get the first time each user viewed the homepage.
SELECT
user_id,
1 AS view_homepage,
min(time) AS view_homepage_time
FROM event
WHERE
data->>'type' = 'view_homepage'
GROUP BY user_id
) e1 LEFT JOIN LATERAL (
-- For each row, get the first time the user_id did the enter_credit_card
-- event, if one exists within two weeks of view_homepage_time.
SELECT
1 AS enter_credit_card,
time AS enter_credit_card_time
FROM event
WHERE
user_id = e1.user_id AND
data->>'type' = 'enter_credit_card' AND
time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
ORDER BY time
LIMIT 1
) e2 ON true

Nobody likes 30-line SQL queries, so let’s break this down into pieces. The first chunk of this is vanilla SQL:

SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
FROM event
WHERE
    data->>'type' = 'view_homepage'
GROUP BY user_id

That is, get the initial time each user did a view_homepage event. Then our lateral join allows us to iterate over each resulting row and perform a parametrized version of the next subquery. This is equivalent to taking the query below and running it for each resulting row:

SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
FROM event
WHERE
    user_id = e1.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
ORDER BY time
LIMIT 1

I.e., for each user, get the first time he or she performed the enter_credit_card event within two weeks of view_homepage_time. Because this is a lateral join, our subquery can make reference to the view_homepage_time results from the previous subquery. Otherwise, the subqueries would be evaluated independently and there would be no way to access results from one when evaluating the other.

Then we wrap the whole thing in a select, which returns something like this:

Because this is a LEFT JOIN, the query still produces result rows for users with no matching enter_credit_card event, as long as there is a view_homepage event. If we aggregate over the numerical columns, we get a tidy summary of this conversion funnel:

SELECT
    sum(view_homepage) AS viewed_homepage,
    sum(enter_credit_card) AS entered_credit_card
FROM (
    -- Get the first time each user viewed the homepage.
    SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
    FROM event
    WHERE
    data->>'type' = 'view_homepage'
    GROUP BY user_id
) e1 LEFT JOIN LATERAL (
    -- For each (user_id, view_homepage_time) tuple, get the first time that
    -- user did the enter_credit_card event, if one exists within two weeks.
    SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
    FROM event
    WHERE
    user_id = e1.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*14)
    ORDER BY time
    LIMIT 1
) e2 ON true

… which produces:

viewed_homepage | entered_credit_card
-----------------+---------------------
827 | 103

We can add intermediate steps to this funnel with more lateral joins to evaluate which portions of our flow we should focus on improving.[2] Let’s add a use_demo step between viewing the homepage and entering a credit card.

SELECT
    sum(view_homepage) AS viewed_homepage,
    sum(use_demo) AS use_demo,
    sum(enter_credit_card) AS entered_credit_card
FROM (
    -- Get the first time each user viewed the homepage.
    SELECT
    user_id,
    1 AS view_homepage,
    min(time) AS view_homepage_time
    FROM event
    WHERE
    data->>'type' = 'view_homepage'
    GROUP BY user_id
) e1 LEFT JOIN LATERAL (
    -- For each row, get the first time the user_id did the use_demo
    -- event, if one exists within one week of view_homepage_time.
    SELECT
    user_id,
    1 AS use_demo,
    time AS use_demo_time
    FROM event
    WHERE
    user_id = e1.user_id AND
    data->>'type' = 'use_demo' AND
    time BETWEEN view_homepage_time AND (view_homepage_time + 1000*60*60*24*7)
    ORDER BY time
    LIMIT 1
) e2 ON true LEFT JOIN LATERAL (
    -- For each row, get the first time the user_id did the enter_credit_card
    -- event, if one exists within one week of use_demo_time.
    SELECT
    1 AS enter_credit_card,
    time AS enter_credit_card_time
    FROM event
    WHERE
    user_id = e2.user_id AND
    data->>'type' = 'enter_credit_card' AND
    time BETWEEN use_demo_time AND (use_demo_time + 1000*60*60*24*7)
    ORDER BY time
    LIMIT 1
) e3 ON true

This yields:

viewed_homepage | use_demo | entered_credit_card
-----------------+----------+---------------------
827 | 220 | 86

This gives us the three-step conversion funnel from viewing the homepage to using the demo within a week to entering the credit card within a week of that. From here, the expressive power of PostgreSQL allows us to drill down on these results and thoroughly analyze the performance of our website. We might follow up with:

Does using the demo increase the likelihood of a signup?
Do users who discover our homepage via an ad convert with the same likelihood as users from other sources?
How do these conversion rates change over time or with different A/B test variants?

The answers to these questions apply directly to product changes and can be determined in PostgreSQL now that it supports lateral joins.

Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries. In an exploratory data science use case, you might just pull your data out of PostgreSQL and analyze it with your scripting language of choice. But there is considerable power in being able to express these questions in SQL, especially if you’re wrapping it all in an approachable UI and exposing the functionality to nontechnical users.

Note that these queries can be tuned so as to be very efficient. In this example, if we create a btree index on (user_id, (data->>’type’), time), we can evaluate each funnel step for each user with a single indexed lookup. If you’re using SSDs, on which seeks are cheap, this might be good enough. If not, you might need to schematize your data a bit differently, but I’ll leave the details of that for another post.

Have a favorite new PostgreSQL feature or a neat lateral join use case? Ping me @danlovesproofs.

Interested in building systems that make powerful features easy to use? Shoot us a note at jobs@heapanalytics.com.

[1] Or we can use Heap, which captures everything for us as soon as we install it! No need to write any logging code, and no risk of forgetting to log something that you’ll want to analyze later.
[2] Note that adding additional steps to a conversion funnel would be particularly easy if we were using a product like Heap, since we would already have the relevant data.
[3] The number of users with enter_credit_card events is lower in this query than in the previous one, as this query only returns enter_credit_card events for users who do so after doing the use_demo event, and 17 of the users signed up without using the demo.