2维照片生成3维模型_如何使用2维模型跟踪状态-CSDN博客

2维照片生成3维模型

Application databases are generally designed to only track current state. For example, a typical user’s data model will store the current settings for each user. This is known as a Type 1 dimension. Each time they make a change, their corresponding record will be updated in place:

应用程序数据库通常设计为仅跟踪当前状态。例如，典型用户的数据模型将存储每个用户的当前设置。这称为1类尺寸。每次进行更改时，其相应的记录都会被更新：

This makes a lot of sense for applications. They need to be able to rapidly retrieve settings for a given user in order to determine how the application behaves. An indexed table at the user grain accomplishes this well.

这对于应用程序来说非常有意义。他们需要能够快速检索给定用户的设置，以便确定应用程序的行为方式。用户粒度上的索引表可以很好地完成此任务。

But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. How many users were using feature “X” 90 days ago? What is the 30 day retention rate of the feature? How often are users turning it off and on? To accomplish these use cases we need a data model that tracks historical state:

但是，作为分析师，我们不仅关心当前状态 (截至目前有多少用户使用功能“ X”)，还关心历史状态 。 90天前有多少用户在使用功能“ X”？该功能的30天保留率是多少？用户多久关闭一次电源？为了完成这些用例，我们需要一个跟踪历史状态的数据模型：

This is known as a Type 2 dimensional model. I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).

这称为2型维模型。我将展示如何使用现代ETL工具(例如PySpark和dbt(数据构建工具))创建这些数据模型。

在Shopify上实现2型尺寸模型 (Implementing Type 2 Dimensional Models at Shopify)

I currently work as a data scientist in the International product line at Shopify. Our product line is focused on adapting and scaling our product around the world. One of the first major efforts we undertook was translating Shopify’s admin in order to make our software available to use in multiple languages.

我目前在Shopify的国际产品线担任数据科学家。我们的产品线致力于在全球范围内适应和扩展我们的产品。我们进行的首批主要工作之一就是翻译Shopify的admin ，以使我们的软件可以使用多种语言。

At Shopify, data scientists work across the full stack -from data extraction and instrumentation, to data modelling, dashboards, analytics, and machine learning powered products. As a product data scientist, I’m responsible for understanding how our translated versions of the product are performing. How many users are adopting them? How is adoption changing over time? Are they retaining the new language, or switching back to English? If we default a new user from Japan into Japanese, are they more likely to become a successful merchant than if they were first exposed to the product in English and given the option to switch? In order to answer these questions, we first had to figure out how our data could be sourced or instrumented, and then eventually modelled.

在Shopify，数据科学家在整个堆栈中工作 -从数据提取和检测到数据建模，仪表板，分析和机器学习支持的产品。作为产品数据科学家，我负责了解我们产品的翻译版本的性能。有多少用户采用它们？收养随着时间如何变化？他们是保留新语言还是切换回英语？如果我们默认将新用户从日本转换为日语，那么他们是否比首次接触英语产品并可以选择切换的客户更有可能成为成功的商人？为了回答这些问题，我们首先必须弄清楚如何获取或处理我们的数据，然后最终对其建模。

The functionality that decides which language to render Shopify in is based on the language setting our engineers added to the users data model.

决定Shopify使用哪种语言的功能基于我们的工程师添加到users数据模型中的language设置。

User 1 will experience the Shopify admin in English, User 2 in Japanese, etc… Like most data models powering Shopify’s software, the users model is a Type 1 dimension. Each time a user changes their language, or any other setting, the record gets updated in place. As I alluded to above, this data model doesn't allow us to answer many of our questions as they involve knowing what language a given user is using at a particular point in time. Instead, we needed a data model that tracked user's languages over time. There are several ways to approach this problem.

用户1将以英语体验Shopify管理员，用户2将以日语体验，等等...像支持Shopify软件的大多数数据模型一样， users模型是Type 1维度。每次用户更改语言或任何其他设置时，记录都会更新到位。正如我在上文中提到的，此数据模型不允许我们回答许多问题，因为它们涉及知道给定用户在特定时间点正在使用哪种语言。相反，我们需要一个数据模型来随时间跟踪用户的语言。有几种方法可以解决此问题。

跟踪状态的选项 (Options For Tracking State)

修改核心应用程序模型设计 (Modify Core Application Model Design)

In an ideal world, the core application database model will be designed to track state. Rather than having a record be updated in place, the new settings are instead appended as a new record. Due to the fact that the data is tracked directly in the source of truth, you can fully trust its accuracy. If you’re working closely with engineers prior to the launch of a product or new feature, you can advocate for this data model design. However, you will often run into two challenges with this approach:

在理想情况下，将设计核心应用程序数据库模型来跟踪状态。而不是在适当位置更新记录，而是将新设置附加为新记录。由于事实是直接在事实来源中跟踪数据，因此您完全可以相信其准确性。如果在产品或新功能发布之前与工程师紧密合作，则可以倡导这种数据模型设计。但是，使用这种方法经常会遇到两个挑战：

Engineers will be very reluctant to change the data model design to support analytical use cases. They want the application to be as performant as possible (as should you), and having a data model which keeps all historical state is not conducive to that.
工程师将非常不愿意更改数据模型设计以支持分析用例。 他们希望应用程序尽可能地提高性能(应由您自己选择)，并且拥有保留所有历史状态的数据模型不利于此。
Most of the time, new features or products are built on top of pre-existing data models. As a result, modifying an existing table design to track history will come with an expensive and risky migration process, along with the aforementioned performance concerns.
在大多数情况下，新功能或产品都是基于预先存在的数据模型构建的。 结果，修改现有表设计以跟踪历史记录将伴随着昂贵且冒险的迁移过程以及上述性能问题。

In the case of rendering languages for the Shopify admin, the language field was added to the pre-existing users model, and updating this model design was out of the question.

在为Shopify管理员渲染语言的情况下，“ language字段已添加到预先存在的users模型中，并且无法更新此模型设计。

拼接数据库快照 (Stitch Together Database Snapshots)

At most technology companies, snapshots of application database tables are extracted into the data warehouse or data lake. At Shopify, we have a system that extracts newly created or updated records from the application databases on a fixed schedule.

在大多数技术公司中，应用程序数据库表的快照都提取到数据仓库或数据湖中。在Shopify，我们有一个系统以固定的时间表从应用程序数据库中提取新创建或更新的记录。

Using these snapshots, one can leverage them as an input source for building a Type 2 dimension. However, given the fixed schedule nature of the data extraction system, it is possible that you will miss updates happening between one extract and the next.

使用这些快照，可以将它们用作构建2型维度的输入源。但是，鉴于数据提取系统具有固定的计划性质，您可能会错过一次提取与下一次提取之间发生的更新。

If you are using dbt for your data modelling, you can leverage their nice built-in solution for building Type 2’s from snapshots!

如果您使用dbt进行数据建模，则可以利用其出色的内置解决方案从快照中构建Type 2！

添加数据库事件记录 (Add Database Event Logging)

Another alternative is to add a new event log. Each newly created or updated record is stored in this log. At Shopify, we rely heavily on Kafka as a pipeline for transferring real-time data between our applications and data land, which makes it an ideal candidate for implementing such a log.

另一种选择是添加一个新的事件日志。每个新创建或更新的记录都存储在此日志中。在Shopify，我们非常依赖Kafka作为在应用程序和数据域之间传输实时数据的管道，这使其成为实现这种日志的理想选择。

If you work closely with engineers, or are comfortable working in your application codebase, you can get new logging in place that will stream any new or updated record to Kafka. Shopify is built on the Ruby on Rails web framework. Rails has something called “ Active Record Callbacks”, which allows you to trigger logic before or after an alternation of an object’s (read “database records”) state. For our use case, we can leverage the after_commit callback to log a record to Kafka after it has been successfully created or updated in the application database.

如果您与工程师紧密合作，或者愿意在应用程序代码库中工作，则可以获取新的日志记录，并将任何新记录或更新记录流式传输到Kafka。 Shopify建立在Ruby on Rails Web框架上。 Rails有一种称为“ 活动记录回调 ”的方法，它使您可以在对象状态(读取“数据库记录”)状态改变之前或之后触发逻辑。对于我们的用例，我们可以利用after_commit回调在应用数据库中成功创建或更新记录后，将记录记录到Kafka。

While this option isn’t perfect, and comes with a host of other caveats I will discuss later, we ended up choosing it as it was the quickest and easiest solution to implement that provided the required granularity.

尽管此选项并不完美，并且还会带来其他一些警告，但我将在后面讨论，但最终选择了它，因为它是实现所需粒度的最快，最简单的解决方案。

2型建模食谱 (Type 2 Modelling Recipes)

Below, I’ll walk through some recipes for building Type 2 dimensions from the event logging option discussed above. We’ll stick with our example of modelling user’s languages over time and work with the case where we’ve added event logging to our database model from day 1 (i.e. when the table was first created). Here’s an example of what our user_update event log would look like:

下面，我将通过上面讨论的事件记录选项逐步介绍一些构建2型维度的方法。我们将继续使用随时间推移对用户语言进行建模的示例，并处理从第一天(即首次创建表时)就将事件日志添加到数据库模型的情况。这是我们的user_update事件日志的示例：

This log describes the full history of the users data model.

该日志描述了用户数据模型的完整历史记录。

User 1 gets created at 2019-01-01 12:14:23 with English as the default language.
在2019-01-01 12:14:23 1月1 2019-01-01 12:14:23使用英语作为默认语言创建用户1。
User 2 gets created at 2019-02-02 11:00:35 with English as the default language.
用户2的创建时间为2019-02-02 11:00:35 ，英语为默认语言。
User 2 decides to switch to French at 2019-02-02 12:15:06.
用户2决定在2019-02-02 12:15:06切换为法语。
User 2 changes some other setting that is tracked in the users model at 2019-02-02 13:01:17.
用户2更改了在用户模型中于2019-02-02 13:01:17跟踪的其他设置。
User 2 decides to switch back to English at 2019-02-02 14:10:01.
用户2决定在2019-02-02 14:10:01切换回英语。

Our goal is to transform this event log into a Type 2 dimension that looks like this:

我们的目标是将事件日志转换为如下所示的Type 2维度：

We can see that the current state for all users can easily be retrieved with a SQL query that filters for WHERE is_current. These records also have a null value for the valid_to column, since they are still in use. However, it is common practice to fill these nulls with something like the timestamp at which the job last ran, since the actual values may have changed since then.

我们可以看到，通过过滤WHERE is_currentSQL查询可以轻松检索所有用户的当前状态。这些记录的valid_to列也为null值，因为它们仍在使用中。但是，通常的做法是使用作业最后一次运行的时间戳记填充这些空值，因为此后实际值可能已更改。

PySpark (PySpark)

Due to Spark’s ability to scale to massive datasets, we use it at Shopify for building our data models that get loaded to our data warehouse. To avoid the mess that comes with installing Spark on your machine, you can leverage a pre-built docker image with PySpark and Jupyter notebook pre-installed. If you want to play around with these examples yourself, you can pull down this docker image with docker pull jupyter/pyspark-notebook:c76996e26e48 and then run docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48 to spin up a notebook where you can run PySpark locally.

由于Spark具有扩展到大量数据集的能力，我们在Shopify上使用它来构建我们的数据模型，并将其加载到数据仓库中。为了避免在计算机上安装Spark带来的麻烦，您可以利用预先安装了PySpark和Jupyter笔记本的预构建 docker 映像。如果您想亲自玩这些示例，可以使用docker pull jupyter/pyspark-notebook:c76996e26e48下拉此docker pull jupyter/pyspark-notebook:c76996e26e48 ，然后运行docker pull jupyter/pyspark-notebook:c76996e26e48 docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48来启动笔记本您可以在本地运行PySpark。

We’ll start with some boiler plate code to create a Spark dataframe containing our sample of user update events:

我们将从一些样板代码开始，以创建一个包含我们的用户更新事件示例的Spark数据框：

from datetime import datetime as dtfrom pyspark import SparkConf, SparkContext, SQLContextfrom pyspark.sql import functions as Fimport pyspark.sql.types as Tfrom pyspark.sql.window import Window
sc = SparkContext(appName="local_spark", conf=SparkConf())
sqlContext = SQLContext(sparkContext=sc)def get_dt(ts_str):return dt.strptime(ts_str, '%Y-%m-%d %H:%M:%S')
user_update_rows = [
    (1, "en", get_dt('2019-01-01 12:14:23'), get_dt('2019-01-01 12:14:23')),
    (2, "en", get_dt('2019-02-02 11:00:35'), get_dt('2019-02-02 11:00:35')),
    (2, "fr", get_dt('2019-02-02 11:00:35'), get_dt('2019-02-02 12:15:06')),
    (2, "fr", get_dt('2019-02-02 11:00:35'), get_dt('2019-02-02 13:01:17')),
    (2, "en", get_dt('2019-02-02 11:00:35'), get_dt('2019-02-02 14:10:01')),
]
user_update_schema = T.StructType([
    T.StructField('id', T.IntegerType()),
    T.StructField('language', T.StringType()),
    T.StructField('created_at', T.TimestampType()),
    T.StructField('updated_at', T.TimestampType()),
])
user_update_events = sqlContext.createDataFrame(user_update_rows, schema=user_update_schema)

With that out of the way, the first step is to filter our input log to only include records where the columns of interest were updated. With our event instrumentation, we log an event whenever any record in the users model is updated. For our use case, we only care about instances where the user’s language was updated (or created for the first time). It’s also possible that you will get duplicate records in your event logs, since Kafka clients typically support “at-least-once” delivery. The code below will also filter out these cases:

有了这一点，第一步就是过滤我们的输入日志，使其仅包含感兴趣的列已更新的记录。使用事件检测，只要users模型中的任何记录被更新，我们都会记录一个事件。对于我们的用例，我们只关心用户的language已更新(或首次创建)的实例。由于Kafka客户通常支持“至少一次”交付，因此您也有可能在事件日志中获得重复的记录。以下代码还将过滤掉这些情况：

window_spec = Window.partitionBy('id').orderBy('updated_at')
change_expression = (F.col('row_num') == F.lit(1)) | (F.col('language') != F.col('prev_language'))
job_run_time = F.lit(dt.now())
user_language_changes = (
    user_update_events
    .withColumn(
        'prev_language',
        F.lag(F.col('language')).over(window_spec)
    )
    .withColumn(
        'row_num',
        F.row_number().over(window_spec)
    )
    .where(change_expression)
    .select(['id', 'language', 'updated_at'])
)
user_language_changes.show()

We now have something that looks like this:

现在，我们有了如下所示的内容：

The last step is fairly simple; we produce one record per period for which a given language was enabled:

最后一步很简单；我们为启用了给定语言的每个期间生成一条记录：

user_language_type_2_dimension = (
    user_language_changes
    .withColumn(
        'valid_to',
        F.coalesce(
            F.lead(F.col('updated_at')).over(window_spec),# fill nulls with job run time            # can also use timestamp of your last event            job_run_time
        )
    )
    .withColumnRenamed('updated_at', 'valid_from')
    .withColumn(
        'is_current',
        F.when(F.col('valid_to') == job_run_time, True).otherwise(False)
    )
)
user_language_type_2_dimension.show()

dbt (dbt)

dbt is an open source tool that lets you build new data models in pure SQL. It’s a tool we are currently exploring using at Shopify to supplement modelling in PySpark, which I am really excited about. When writing PySpark jobs, you’re typically taking SQL in your head, and then figuring out how you can translate it to the PySpark API. Why not just build them in pure SQL? dbt lets you do exactly that:

dbt是一个开放源代码工具，可让您使用纯SQL构建新的数据模型。我们目前正在Shopify上探索该工具，以补充PySpark中的建模，对此我感到非常兴奋。在编写PySpark作业时，您通常会想到SQL，然后弄清楚如何将其转换为PySpark API。为什么不只用纯SQL构建它们呢？ dbt可以让您完全做到这一点：

WITH-- create our sample data
user_update_events (id, language, created_at, updated_at) AS (VALUES
  (1, 'en', timestamp'2019-01-01 12:14:23', timestamp'2019-01-01 12:14:23'),
  (2, 'en', timestamp'2019-02-02 11:00:35', timestamp'2019-02-02 11:00:35'),
  (2, 'fr', timestamp'2019-02-02 11:00:35', timestamp'2019-02-02 12:15:06'),
  (2, 'fr', timestamp'2019-02-02 11:00:35', timestamp'2019-02-02 13:01:17'),
  (2, 'en', timestamp'2020-01-01 15:05', timestamp'2019-02-02 14:10:01')
),
users_with_previous_state AS (SELECT
    id,language,
    updated_at,
    LAG(language) OVER (PARTITION BY id ORDER BY updated_at ASC) AS prev_language,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at ASC) AS row_numFROM
    user_update_events
),-- filter to instances where the column of interest (language) actually changed-- or we are seeing a user record for the first time
user_language_changes AS (SELECT*FROM
    users_with_previous_stateWHERE
    row_num=1OR language <> prev_language
),-- build the type 2!
user_language_type_2_dimension_base AS (SELECT
    id,language,
    updated_at AS valid_from,
    LEAD(updated_at) OVER (PARTITION BY id ORDER BY updated_at ASC) AS valid_toFROM
    user_language_changes
)-- fill "valid_to" nulls with job run time-- or, you could instead use the timestamp of your last update event/extractSELECT
  id,language,
  valid_from,
  COALESCE(valid_to, CURRENT_TIMESTAMP) AS valid_to,CASEWHEN valid_to IS NULL THEN TrueELSE FalseEND AS is_currentFROM
  user_language_type_2_dimension_base

With this SQL, we have replicated the exact same steps done in the PySpark example and will produce the same output shown above.

使用此SQL，我们已经复制了与PySpark示例相同的步骤，并将产生与上面所示相同的输出。

技巧，经验教训和前进之路 (Gotchas, Lessons Learned, and The Path Forward)

I’ve leveraged the approaches outlined above with multiple data models now. Here are a few of the things I’ve learned along the way.

现在，我已经在多个数据模型中利用了上面概述的方法。这是我在此过程中学到的一些知识。

1. It took us a few tries before we landed on the approach outlined above.

1.经过几次尝试，我们才着手采用上述方法。

In some initial implementations, we were logging the record changes before they had been successfully committed to the database, which resulted in some mismatches in the downstream Type 2 models. Since then, we’ve been sure to always leverage the after_commit callback based approach.

在某些最初的实现中，我们在成功将记录更改提交到数据库之前记录了记录更改，这导致下游的Type 2模型不匹配。从那时起，我们确保始终使用基于after_commit回调的方法。

2. There are some pitfalls with logging changes from within the code:

2.从代码内部记录更改有一些陷阱：

Your event logging becomes susceptible to future code changes. For example, an engineer refactors some code and removes the after_commit call. These are rare, but can happen. A good safeguard against this is to leverage tooling like the CODEOWNERS file, which notifies you when a particular part of the codebase is being changed.
您的事件日志记录容易受到将来代码更改的影响 。例如，工程师重构一些代码并删除after_commit调用。这些很少见，但有可能发生。一个很好的保护措施是利用诸如CODEOWNERS文件之类的工具，该工具会在代码库的特定部分发生更改时通知您。
You may miss record updates that are not triggered from within the application code. Again, these are rare, but it is possible to have an external process that is not using the Rails User model when making changes to records in the database.
您可能会错过未从应用程序代码内部触发的记录更新 。同样，它们很少见，但是在对数据库中的记录进行更改时，可能有一个不使用Rails User模型的外部过程。

3. It is possible to lose some events in the Kafka process.

3.在Kafka进程中可能会丢失一些事件 。

For example, if one of the Shopify servers running the Ruby code were to fail before the event was successfully emitted to Kafka, you would lose that update event. Same thing if Kafka itself were to go down. Again, rare, but nonetheless something you should be willing to live with. There are a few ways you can mitigate the impact of these events:

例如，如果在事件成功发送给Kafka之前，运行Ruby代码的Shopify服务器之一发生故障，您将丢失该更新事件。如果卡夫卡本身倒下，那也是一样。再次，虽然很少见，但您仍然愿意与之共处。有几种方法可以减轻这些事件的影响：

Have some continuous data quality checks running that compare the Type 2 dimensional model against the current state and checks for discrepancies.
进行一些连续的数据质量检查，将Type 2维模型与当前状态进行比较，并检查差异。
If & when any discrepancies are detected, you could augment your event log using the current state snapshot.
如果＆在检测到任何差异时，可以使用当前状态快照扩展事件日志。

4. If deletes occur in a particular data model, you need to implement a way to handle this.

4.如果删除发生在特定的数据模型中，则需要实现一种处理方式。

Otherwise, the deleted events will be indistinguishable from normal create or update records with the logging setup I showed above. Here are some ways around this:

否则，使用上面显示的日志记录设置，已删除的事件将无法与正常的创建或更新记录区分开。以下是一些解决方法：

Have your engineers modify the table design to use soft deletes instead of hard deletes.
让您的工程师修改表设计以使用软删除而不是硬删除。
Add a new field to your Kafka schema and log the type of event that triggered the change, i.e. (create, update, or delete), and then handle accordingly in your Type 2 model code.
在您的Kafka模式中添加一个新字段，并记录触发更改的事件的类型，即( create ， update或delete )，然后在Type 2模型代码中进行相应的处理。

Implementing Type 2 dimensional models for Shopify’s admin languages was truly an iterative process and took investment from both data and engineering to successfully implement. With that said, we have found the analytical value of the resulting Type 2 models well worth the upfront effort.

为Shopify的管理语言实现Type 2维度模型确实是一个反复的过程，并花费了数据和工程方面的投资来成功实现。话虽如此，我们发现生成的Type 2模型的分析价值非常值得前期努力。

Looking ahead, there’s an ongoing project at Shopify by one of our data engineering teams to store the MySQL binary logs (binlogs) in data land. Binlogs are a much better source for a log of data modifications, as they are directly tied to the source of truth (the MySQL database), and are much less susceptible to data loss than the Kafka based approach. With binlog extractions in place, you don’t need to add separate Kafka event logging to every new model as changes will be automatically tracked for all tables. You don’t need to worry about code changes or other processes making updates to the data model since the binlogs will always reflect the changes made to each table. I am optimistic that with binlogs as a new, more promising source for logging data modifications, along with the recipes outlined above, we can produce Type 2s out of the box for all new models. Everybody gets a Type 2!

展望未来，我们的一个数据工程团队正在Shopify正在进行一个项目，以将MySQL 二进制日志 (binlog)存储在数据域中。 Binlog是数据修改日志的更好的来源，因为它们直接与真实来源(MySQL数据库)联系在一起，并且比基于Kafka的方法更不易丢失数据。有了binlog提取功能后，您无需为每个新模型添加单独的Kafka事件日志记录，因为将自动跟踪所有表的更改。您无需担心代码更改或其他对数据模型进行更新的过程，因为二进制日志将始终反映对每个表所做的更改。我乐观地认为，将binlogs作为记录数据修改的新的，更有希望的来源，以及上面概述的方法，我们可以为所有新模型开箱即用地生产Type 2。每个人都有2型！

附加信息 (Additional Information)

SQL查询食谱 (SQL Query Recipes)

Once we have our data modelled as a Type 2 dimension, there are a number of questions we can start easily answering:

一旦我们将数据建模为2类维度，就可以轻松回答以下问题：

/*
The following queries were run in Postgres version 11.5
user_language_type_2_dimension was created using the mock data from above.
*/-- How many users are currently using Japanese?SELECTCOUNT(*) AS num_usersFROM
  user_language_type_2_dimensionWHERE
  is_currentAND language='ja'
;-- How many users were using Japanese 30 days ago?SELECTCOUNT(*) AS num_usersFROM
  user_language_type_2_dimensionWHERECURRENT_DATE - INTERVAL '30' DAY >= valid_fromAND CURRENT_DATE - INTERVAL '30' DAY < valid_toAND language='ja'
;-- How many users per language, per day?WITH-- dynamically generate a distinct list of languages-- based on what is actually in the model
all_languages AS (SELECTlanguageFROM
    user_language_type_2_dimensionGROUP BY 1
),-- generate a range of dates we are interested in-- leverage database's built in calendar functionality-- if you don't have a date_dimension in your warehouse
date_range AS (SELECT
    date::date AS dtFROM
    GENERATE_SERIES(DATE'2019-01-01', CURRENT_DATE, INTERVAL '1' DAY) as t(date)
)SELECT
  dr.dt,
  al.language,COUNT(DISTINCT ld.id) AS num_usersFROM
  date_range AS drCROSS JOIN all_languages AS alLEFT JOIN user_language_type_2_dimension AS ldON dr.dt >= ld.valid_fromAND dr.dt < ld.valid_toAND al.language=ld.languageGROUP BY 1,2
;-- What is the 30-day retention rate of each language?WITH
user_languages AS (SELECT
    id,language,MIN(valid_from) AS first_enabled_at,MIN(valid_from) + INTERVAL '30' DAY AS first_enabled_at_plus_30dFROM
    user_language_type_2_dimensionGROUP BY 1,2
),
user_retention_inds AS (SELECT
  ul.id,
  ul.language,
  first_enabled_at_plus_30d,CASEWHEN ld.id IS NOT NULL THEN 1ELSE 0END AS still_enabled_after_30dFROM
  user_languages AS ulLEFT JOIN user_language_type_2_dimension AS ldON ul.first_enabled_at_plus_30d >= ld.valid_fromAND ul.first_enabled_at_plus_30d < ld.valid_toAND ul.language=ld.languageAND ul.id=ld.id
)SELECTlanguage,COUNT(*) AS num_users_enabled_language_ever,
  100.0*SUM(still_enabled_after_30d)/COUNT(*) AS language_30d_retention_rateFROM
  user_retention_indsWHERE-- only consider users where 30 days have passed since they enabled the language
  first_enabled_at_plus_30d < CURRENT_DATEGROUP BY 1
;