bigquery_在BigQuery中链接多个SQL查询

bigquery

Bigquery is a fantastic tool! It lets you do really powerful analytics works all using SQL like syntax.

Bigquery是一个很棒的工具! 它使您能够使用像语法一样SQL来进行真正强大的分析工作。

But it lacks chaining the SQL queries. We cannot run one SQL right after the completion of another. There are many real-life applications where the output of one query depends upon for the execution of another. And we would want to run multiple queries to achieve the results.

但是它缺少链接SQL查询的方法。 我们不能在完成另一个SQL之后立即运行一个SQL。 在许多实际应用中,一个查询的输出取决于另一个查询的执行。 我们希望运行多个查询来获得结果。

Here is one scenario, suppose you are doing RFM analysis using BigQuery ML. Where first you have to calculate the RFM values for all the users then apply the k-means cluster to the result of the first query and then merge the output of the first query and second query to generate the final data table.

这是一种情况,假设您正在使用BigQuery ML进行RFM分析 。 首先,您必须为所有用户计算RFM值,然后将k-means群集应用于第一个查询的结果,然后合并第一个查询和第二个查询的输出以生成最终数据表。

In the above scenario, every next query depends upon the output of the previous query and the output of each query also needs to be stored in data for other uses.

在上述情况下,每个下一个查询都取决于上一个查询的输出,每个查询的输出也需要存储在数据中以用于其他用途。

I this guide I will show how to execute as many SQL queries as you want in BigQuery one after another creating a chaining effect to gain the desire results.

在本指南中,我将展示如何在BigQuery中一个接一个地执行任意数量SQL查询,以及如何创建链接效果以获得所需的结果。

方法 (Methods)

I will demonstrate two approaches to chaining the queries

我将演示两种链接查询的方法

  1. The First using cloud pub/sub and cloud function: This is a more sophisticated method as it ensures that the current query is finished executing before executing the next one. This method also required a bit of programming experience so better to reach out to someone with a technical background in your company.

    第一种使用云发布/订阅和云功能:这是一种更为复杂的方法,因为它可以确保在执行下一个查询之前完成当前查询的执行。 这种方法还需要一点编程经验,因此更好地与您公司中具有技术背景的人员联系。

  2. The second using BigQuery’s own scheduler: However, the query scheduler cannot ensure the execution of one query is complete before the next is triggered so we will have to hack it using query execution time. More on this later.

    第二个使用BigQuery自己的调度程序:但是,查询调度程序无法确保一个查询的执行在触发下一个查询之前就已经完成,因此我们将不得不利用查询执行时间来破解它。 稍后再详细介绍。

And If you want to get your hands dirty yourself then here is an excellent course to start with.

而且,如果您想弄脏自己的双手,那么 这是一个很好的 起点。

Note: We will continue with the RFM example discussed above to get you the idea of the process. But the same can be applied for any possible scenario where triggering multiple SQL queries is needed.

注意 :我们将继续上面讨论的RFM示例,以使您了解该过程。 但是,对于需要触发多个SQL查询的任何可能情况,也可以应用相同的方法。

方法1 (Method 1)

Method 1 uses the combination of cloud functions and pub/subs to chain the entire flow. The process starts by query scheduler which after executing the first query sends a message to pub/sub topic that triggers a cloud function responsible to trigger 2nd query and once completed sends a message to another pub/sub topic to start yet another cloud function. The process continues until the last query is executed by the cloud function.

方法1使用云功能和pub / sub的组合来链接整个流程。 该过程由查询调度程序开始,查询调度程序在执行第一个查询后向pub / sub主题发送一条消息,该消息触发一个负责触发第二次查询的云功能,一旦完成,就向另一个pub / sub主题发送一条消息以启动另一个云功能。 该过程一直持续到云功能执行最后一个查询为止。

Let’s understand the process with our RFM analysis use case.

让我们通过我们的RFM分析用例来了解流程。

Suppose we have three queries that are needed to be run one after another to perform RFM analysis. First, that calculates RFM values, we will call it RFM Values. Second, that creates the model, we will call itRFM Model. Third, that merges model output with users RFM values, we will call it RFM Final.

假设我们有三个查询需要一个接一个地运行以执行RFM分析。 首先 ,它计算RFM值,我们将其称为 RFM Values 其次 ,创建模型,我们将其称为 RFM Model 第三 ,将模型输出与用户RFM值合并,我们将其称为 RFM Final

Here is how the data pipeline looks like:

数据管道如下所示:

Image for post
Chaining query in BigQuery data pipeline, by Muffaddal
Muffaddal在BigQuery数据管道中链接查询

Note: I will assume that tables for all three queries have already been created.

注意我将假设已经创建了所有三个查询的表。

1- We start by first creating a Pub/Sub topic as it will be needed while creating RFM Values query schedular. I have named it RFM_Model_Topic as it will trigger the cloud function responsible for executing our model query (i.e RFM Model).

1-我们首先创建一个Pub / Sub主题,因为在创建RFM Values查询计划时将需要它。 我将其命名为RFM_Model_Topic ,因为它将触发负责执行我们的模型查询的云函数(即RFM Model )。

Image for post
_Topic Pub/sub topic, by Muffaddal _Topic Pub / sub主题,作者:Muffaddal

Copy the topic name that is needed while creating RFM Values schedular.

计划创建 RFM Values ,复制所需的主题名称

2- Next, go to BigQuery, paste the RFM Values query that calculates RFM values for our users, in the query editor, and click the ‘Schedule query’ button to create a new query schedular.

2-接下来,转到BigQuery,在查询编辑器中粘贴为我们的用户计算RFM值的RFM Values查询,然后单击“计划查询”按钮以创建新的查询计划。

Image for post
create a scheduled query, by Muffaddal
通过Muffaddal创建计划的查询

3- Enter the required values in the scheduler creation menu to create the scheduler

3-在调度程序创建菜单中输入所需的值以创建调度程序

Image for post
query schedular creation menu, by Muffaddal
查询时间表创建菜单,由Muffaddal编写

What this scheduler will do is it will execute on the specified time to calculate users' recency, frequency, and monetary values and store it in the mentioned BigQuery table. Once the schedule is done executing the query it will send a message to our RFM_Model_Topic which will trigger a cloud function to trigger our model query. So next let’s create a cloud function.

该调度程序将执行的操作是在指定的时间执行以计算用户的新近度,频率和货币值,并将其存储在提到的BigQuery表中。 计划执行完查询后,它将向我们的RFM_Model_Topic发送一条消息,这将触发一个云函数来触发我们的模型查询。 因此,接下来让我们创建一个云功能。

4- Go to RFM_Model_Topicpub/sub topi and click ‘Trigger Cloud Function’ Button at the top of the screen.

4-转到RFM_Model_Topic pub / sub RFM_Model_Topic ,然后单击屏幕顶部的“触发云功能”按钮。

Image for post
create cloud function from pub/sub topic, by Muffaddal
通过发布/订阅主题创建云函数,作者:Muffaddal

5- Enters settings as shown below and name the cloud function as RFM_Model_Function

5-输入如下所示的设置,并将云功能命名为RFM_Model_Function

Image for post
cloud function settings, by Muffaddal
云功能设置,通过Muffaddal

6- And paste below code in index.js file

6-并将以下代码粘贴到index.js文件中

Cloud function to trigger RFM_Model Query, by Muffaddal
通过Muffaddal触发Cloud功能以触发RFM_Model查询

Once the query is executed cloud function sends a publish message to a new pub/sub topic named RFM_Final which triggers cloud function responsible for the last query that combines both RFM values and model results in one data set.

执行查询后,云功能会将发布消息发送到名为RFM_Final的新发布/子主题,该主题会触发负责最后一次查询的云功能,该功能将RFM值和模型结果组合到一个数据集中。

7- Therefore, next, create RFM_Model topic in pub/sub and a cloud function as we did in the previous step. Copy-paste below code in cloud function so that it can run the last query.

7-因此,接下来,像在上一步中一样,在pub / sub和一个云函数中创建RFM_Model主题。 将以下代码复制粘贴到云函数中,以便它可以运行最后一个查询。

Cloud function to trigger RFM_Final Query, by Muffaddal
通过Muffaddal触发Cloud功能以触发RFM_Final查询

And that is it!

就是这样!

We can use as many pub/sub and cloud functions as we want to chain as many SQL queries as we want.

我们可以使用任意数量的pub / sub和cloud函数,以根据需要链接任意数量SQL查询。

方法2 (Method 2)

Now the first approach is robust but requires a bit of programming background and says it is not your strong suit. You can use method 2 to chain the BigQuery queries.

现在,第一种方法是健壮的,但是需要一定的编程背景,并且说这不是您的强项。 您可以使用方法2链接BigQuery查询。

BigQuery’s query scheduler can be used to run the queries one after another.

BigQuery的查询计划程序可用于依次运行查询。

Idea is that we start the process the same as we did in method 1, i.e. trigger the first query using a scheduler and estimate its time for completion. Let’s say the first query takes 5 minutes to complete. What we will do is trigger the 2nd query 10 mints after the first query start time. This way we are ensured that the second query is triggered after the first query is completely executed.

想法是,我们开始的过程与方法1相同,即使用调度程序触发第一个查询并估计其完成时间。 假设第一个查询需要5分钟才能完成。 我们要做的是在第一个查询开始时间之后10分钟触发第二个查询。 这样,我们可以确保在完全执行第一个查询后触发第二个查询。

Let’s understand this by example

让我们通过示例来了解这一点

Image for post
Chain queries using query scheduler, by Muffaddal
使用查询调度程序链接查询,作者:Muffaddal

Suppose we scheduled the first query at 12:30 am. It takes 10 mints to complete. So we know at 12:40 am the first query should be completed. We will set the second query scheduler to execute at 12:50 am (keeping a 10 mint gap between two schedulers just in case). And we will trigger the third query at 1:10 am and so on.

假设我们将第一个查询安排在上午12:30。 它需要10颗薄荷糖才能完成。 因此,我们知道应该在12:40 am完成第一个查询。 我们将第二个查询调度程序设置为在上午12:50执行(以防万一,两个调度程序之间要保持10分钟的间隔)。 然后,我们将在上午1:10触发第三个查询,依此类推。

Note: Since the query scheduler doesn’t work with BigQuery ML, therefore, method 2 won’t work for our RFM analysis case but It should get you the idea on how to use the scheduler to chain queries.

注意 :由于查询调度程序不适用于BigQuery ML,因此方法2在我们的RFM分析案例中不起作用,但是它应该使您了解如何使用调度程序链接查询。

摘要 (Summary)

Executing queries one after another helps to achieve really great results especially when the result of one query depends on the output of another and all the query results are also needed as table format as well. BigQuery out of the box doesn’t support this functionality but using GCP’s component we can streamline the process to achieve the results.

逐个执行查询有助于获得非常好的结果,尤其是当一个查询的结果取决于另一个查询的输出并且所有查询结果也都需要作为表格式时。 开箱即用的BigQuery不支持此功能,但是使用GCP的组件,我们可以简化流程以实现结果。

In this article, we went through two of the method to do this. First using the cloud pub/sub and cloud function and another using BigQuery’s own query scheduler.

在本文中,我们介绍了两种方法来执行此操作。 首先使用云发布/订阅和云功能,另一个使用BigQuery自己的查询调度程序。

With this article, I hope I was able to convey the idea of the process for you to pick it up and tailor it for your particular business case.

希望通过这篇文章,您可以传达有关流程的想法,以供您选择并针对您的特定业务案例进行调整。

您想要的类似读物: (Similar Reads You Would Like:)

  1. Automate the RFM analysis using BigQuery ML.

    使用BigQuery ML自动执行RFM分析。

  2. Store Standard Google Analytics Hit Level Data in BigQuery.

    在BigQuery中存储标准Google Analytics(分析)点击量数据

  3. Automate Data Import to Google Analytics using GCP.

    使用GCP自动将数据导入Google Analytics(分析)

翻译自: https://towardsdatascience.com/chaining-multiple-sql-queries-in-bigquery-8498d8885da5

bigquery

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值