azure 数据磁盘_使用Azure数据工厂管道更改接近实时数据流的数据捕获

最新推荐文章于 2023-02-21 09:05:59 发布

学院导师-张野

最新推荐文章于 2023-02-21 09:05:59 发布

阅读量374

点赞数

文章标签： python 大数据人工智能

原文链接：https://medium.com/@dario.palladino/change-data-capture-near-real-time-data-streaming-with-an-azure-data-factory-pipeline-bee6d3573e0b

版权

azure 数据磁盘

Transferring data from one data source to a specific target almost real-time implies most of the time the use of a message broker that would ingest and distribute the data.

几乎实时地将数据从一个数据源传输到特定目标意味着大多数时间使用消息代理，该消息代理将提取和分发数据。

The scenario that I’m going to describe you here, instead, is based on a near real-time end-to-end solution that uses Azure Data Factory as the orchestrator.

我将在这里为您描述的场景基于使用Azure数据工厂作为协调器的近实时端到端解决方案。

For both solution, though, you would need a Change Data Capture feature, or similar depending on your sources, to track and store data changes.

但是，对于这两种解决方案，您都需要具有更改数据捕获功能或其他类似功能(取决于您的来源)来跟踪和存储数据更改。

As I’ve spent some time trying figuring out how to get all the table changes on a big database without overloading the database server itself with jobs that would have been looping for catching the changes as they happen, I thought you may be interested into a valid alternative approch to those available already online, by using a the Change Data Capture feature in a bit more modern fashion.

由于我花了一些时间试图弄清楚如何在一个大型数据库上获得所有表更改，而又不会使数据库服务器本身超负荷工作，而这些工作本来就是为了捕获更改而循环进行的，所以我想您可能会对通过以更现代的方式使用“更改数据捕获”功能，可以有效替代现有的在线方法。

Let’s consider these assumptions, though:

让我们考虑以下假设：

Change Data Capture is a SQL Server feature that can be enabled easily per each table we want to track. The CDC configuration is out of scope for this article and thus, there’re tons of documentation online you can use.
更改数据捕获是SQL Server的一项功能，可以为要跟踪的每个表轻松启用。 CDC配置不在本文讨论范围之内，因此，您可以在线使用大量文档。
The database server is running on an Azure Cloud PaaS delivery model and a SQL Managed Instance is basically the PaaS solution of a SQL Server implementation that you could have had On-Premises. The Cloud PaaS approach has severl benefit, though, such as hiding the underlying management of the server itself and ease the scalability and high-availability of the solution.
数据库服务器正在Azure云PaaS交付模型上运行，而SQL受管实例基本上是您可能具有本地部署SQL Server实现的PaaS解决方案。但是，Cloud PaaS方法具有很多好处，例如隐藏服务器本身的基础管理并简化解决方案的可伸缩性和高可用性。
A strict requirement is to avoid overloading the SQL Server with computantion that are not related to the source platform.
严格的要求是避免使用与源平台无关的计算量使SQL Server过载。
The scope of my solution is to build a Data Hub with data coming from multiple platform such as APIs, databases and general events happening in the context of the data that I need to collect.
我的解决方案的范围是用来自多个平台的数据构建数据中心，例如API，数据库以及在需要收集的数据上下文中发生的一般事件。

Hence, here I’m going to explain you a small part of a bigger Data Hub solution in the context of the Azure Cloud, and this below is what I’m going to highlight.

因此，在这里，我将向您介绍Azure云环境下更大的Data Hub解决方案的一小部分，下面将重点介绍这一点。

from SQL Managed Instance to Azure SQL database through Change Data Capture using Azure Data Factory

As Google is a wonderful resource to avoid to re-invent the wheel, I’m not loathe to admit I’ve googled a lot but the only suggestions I’ve found valuable were:

由于Google是避免重新发明轮子的绝佳资源，因此我不愿意承认我已经在Google上进行了大量搜索，但是我发现有价值的唯一建议是：

An old but still very interesting article about running a piece of code for each table that would push table changes to an Event Hub. Nice article but I prefer other approaches. And, nevertherless this solution was meant to run a job on the source SQL Server side that would have overloaded the server itself.
一篇古老而又非常有趣的文章，它涉及为每个表运行一段代码，该代码会将表更改推送到事件中心。不错的文章，但我更喜欢其他方法。而且，尽管如此，该解决方案还是要在源SQL Server端运行作业，而这可能会使服务器本身过载。
Debezium, a wonderful component based on Apache Kafka and connectors to any kind of database that would listen to changes as they happen and tranfer them all to an Apache Kafka implementation. This solution is wonderful if you don’t have to stick to PaaS services from a specific Cloud provider and rather prefer to go for an OpenSource approach. I like this solution the most but my primary constraint was to stick to an Azure PaaS delivery model using Data Factory, Azure SQL database, Event Hub and Azure Stream Analytics.
Debezium是一个基于Apache Kafka的出色组件，它与任何类型的数据库建立连接器，该数据库可以侦听更改的发生并将其全部转移到Apache Kafka实现中。如果您不必坚持使用特定云提供商的PaaS服务，而是宁愿采用OpenSource方法，则此解决方案非常好。我最喜欢此解决方案，但主要的限制是要坚持使用Data Factory，Azure SQL数据库，事件中心和Azure流分析的Azure PaaS交付模型。
A DataFactory approach to store data in storage accounts.
一种DataFactory方法，用于在存储帐户中存储数据。

That being said, I had to find a solution myself and a nice article from Microsoft was pointing me out to the right direction, obviously!

话虽这么说，我必须自己找到一个解决方案，显然，Microsoft的一篇好文章为我指出了正确的方向！

So, let’s get started.

因此，让我们开始吧。

First, enable your CDC feature on the source database and track your tables. This feature will create several system tables, one per each table you’re tracking. Don’t worry for the extra amount of data stored, the default data retention is configured to be erased after 3 days, nonetheless you can change this configuration easily.

首先，在源数据库上启用CDC功能并跟踪表。此功能将创建多个系统表，每个要跟踪的表一个。不必担心会存储额外的数据，默认的数据保留配置为3天后会删除，不过您可以轻松更改此配置。

Create your Data Factory environment. There’re tons of documentation for doing so.

创建数据工厂环境。这样做的文档很多。

Create a self-hosted integration runtime, IR, if you have specific needs. Otherwise, you can opt to use the Azure IR but you have to make sure it can connect to your data sources. In my case, I had to spin up a Virtual Machine with a dedicated IR as the two solutions were running on different private networks and the networking behind the scene is a bit complex, with firewall and routing rules. For less complex environment, the Azure IR is quite good.

如果您有特殊需要，请创建一个自托管的集成运行时IR。否则，您可以选择使用Azure IR，但必须确保它可以连接到数据源。就我而言，我必须启动一个具有专用IR的虚拟机，因为这两种解决方案都在不同的专用网络上运行，并且幕后的网络有点复杂，并带有防火墙和路由规则。对于不太复杂的环境，Azure IR相当不错。

Create your Linked services. This is basically a connection to your data source, in our scenario, the SQL Server Managed Instance and to the target storage. In our case, the target storages are respectively, Azure Storage Account Gen2 Data Lake and Azure SQL Database.

创建您的链接服务。在我们的方案中，这基本上是到您的数据源，SQL Server托管实例和目标存储的连接。在我们的例子中，目标存储分别是Azure存储帐户Gen2 Data Lake和Azure SQL数据库。

A Linked service is just the connector specification to your data source/target. A linked service can be used multiple time with different specification based on your needs. To do so, you have to create a dataset specification on top of your linked service.

链接服务只是您数据源/目标的连接器规范。链接服务可以根据您的需要多次使用不同的规范。为此，您必须在链接服务之上创建一个数据集规范。

For this scenario, you should have now three Linked services and three dataset specifications.

对于这种情况，您现在应该具有三个链接服务和三个数据集规范。

Nice! Let’s move on. From now on, up until the final copy of the data to a parquet file in the DataLake, the Microsoft documentation is pretty fine and easy to follow. I’ll try to walk you quickly through it with the steps to take.

真好！让我们继续前进。从现在开始，直到将数据最终复制到DataLake中的镶木地板文件中，Microsoft文档都非常好并且易于遵循。我将尝试逐步引导您完成相关步骤。

It’s also important to note that I wanted to avoid further staging tables in the target solution and rather preferred to use in-memory tables. I don’t want staging tables as I already have my parquet files as the history of the changes. However, this choice may have added a bit of complexity in the stored procedure to define but I think it’s much more clean and smooth.

同样重要的是要注意，我想避免在目标解决方案中进一步登台表，而更倾向于使用内存表。我不希望暂存表，因为我已经有实木复合地板文件作为更改历史记录。但是，这种选择可能在存储过程中增加了一些定义的复杂性，但是我认为它更加干净和流畅。

First create a Stored Procedure per each table you want to get the changes from. These SPs will serve the scope of upserting the target table based on the __$operation coming from the CDC. These SPs are not very standard ones. They expect a recordset of the change and per each change, they iterate to update/insert or delete records in the target table. To do so, we also need another small object in the target database: a table type. Basically, you need to define a data type for the recordset you are receiving from the CDC, in order to map each field and be able to use it as a in-memory table.
首先为每个要从中获取更改的表创建一个存储过程。这些SP将基于CDC的__ $操作服务于升级目标表的范围。这些SP不是很标准的SP。他们期望更改的记录集，并且每次更改都会迭代更新/插入或删除目标表中的记录。为此，我们还需要目标数据库中的另一个小对象： table type 。基本上，您需要为从CDC接收的记录集定义数据类型，以便映射每个字段并将其用作内存表。

In order to keep this pipeline as much decoupled and reusable as possible, I will be using several pipeline parameters. This will allow you to run this pipeline independently by the table you are tracking.

为了使该管道尽可能地分离和可重用，我将使用几个管道参数。这将允许您根据要跟踪的表独立运行此管道。

Create your pipeline, assign it a name and add a LookUp activity. The Lookup activity will count the changes in the table you’re tracking in the source system using a simple query. Watch out the pipelines’ parameters I’m using here. I will do this in the entire chain in order to have one pipeline that I can use for any CDC table I’m tracking.
创建管道，为其分配名称，然后添加LookUp活动。查找活动将使用一个简单的查询来统计您正在源系统中跟踪的表中的更改。注意我在这里使用的管道参数。我将在整个链中进行此操作，以使我可以使用一个管道来跟踪任何CDC表。

@{concat(‘DECLARE @from_lsn binary(10), @to_lsn binary(10); SET @from_lsn =sys.fn_cdc_get_min_lsn(‘’’, pipeline().parameters.CDC_Schema, ‘_’, pipeline().parameters.CDC_Table, ‘’’); SET @to_lsn = sys.fn_cdc_map_time_to_lsn(‘’largest less than or equal’’, GETDATE()); SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_’, pipeline().parameters.CDC_Schema, ‘_’, pipeline().parameters.CDC_Table, ‘(@from_lsn, @to_lsn, ‘’all’’)’ )}

@ {concat('DECLARE @from_lsn二进制(10)， @to_lsn二进制(10); SET @from_lsn = sys.fn_cdc_get_min_lsn('''， pipeline()。parameters.CDC_Schema ，'_'， pipeline()。参数。 CDC_Table ，'''); SET @to_lsn = sys.fn_cdc_map_time_to_lsn(``最大小于或等于''，GETDATE()); SELECT count(1)changecount from cdc.fn_cdc_get_all_changes _'， pipeline()。parameters.CDC_Schema ， '_'， pipeline()。parameters.CDC_Table ，'( @ @from_lsn ， @to_lsn ，` `all' ')')}

Add an If Condition and evaluate the output from the Lookup activity. If the count is greater than 0, then click on the “true statement”’s edit icon and add a Copy activity.
添加一个If条件并评估Lookup活动的输出。如果计数大于0，则单击“ true statement”的编辑图标并添加一个Copy活动。
In the IF true activities I would suggest to freeze a Timestamp and create a defined file structure that can be used along by whole chain of the true statement. Something like:
在IF true活动中，我建议冻结Timestamp并创建一个定义的文件结构，以供true语句的整个链使用。就像是：

The Copy Data activity is used to get the changes from the CDC table and store them all in a Parquet file into the DataLake container. Why this step? Well, it is a good practice to store data that you transfer from one source to a target solution, as this provides multiple benefits: you can rerun the whole solution in case something goes south for any reason. Basically you can reproduce the whole chain of changes happened during the history of events. Another benefits is that you can use Big data tools, such as Databrick, Azure Synapse with Apache Spark, EMR and so on, to query and analyse all the data and come up with whatever fancy requirement the business may have requested. Last but not least, you have historical data.
复制数据活动用于从CDC表中获取更改，并将所有更改都存储在Parquet文件中到DataLake容器中。为什么要这一步？好的，将从一个源传输到目标解决方案的数据存储起来是一个好习惯，因为这样做有很多好处：您可以重新运行整个解决方案，以防万一由于某种原因而出现问题。基本上，您可以重现事件历史中发生的整个变化链。另一个好处是您可以使用大数据工具(例如Databrick，带有Apache Spark的Azure Synapse，EMR等)来查询和分析所有数据，并提出业务可能需要的任何花哨的要求。最后但并非最不重要的是，您具有历史数据。
In the Source tab choose your data source dataset and add this query:
在“源”选项卡中，选择数据源数据集并添加以下查询：

@{concat(….SELECT * FROM cdc.fn_cdc_get_all_changes_’, pipeline().parameters.CDC_Schema, ‘_’, pipeline().parameters.CDC_Table, ‘(@from_lsn, @to_lsn, ‘’all’’)’)}

@ {CONCAT(... SELECT * FROM cdc.fn_cdc_get_all_changes_”，管道()。parameters.CDC_Schema， '_'，管道()。parameters.CDC_Table， '(@from_lsn，@to_lsn，'所有' ')') }

The above is basically the same query as before with a simple substitution. I’ve replaced the select count() with the select * to get the actual changes.
上面的查询基本上与以前的查询相同，只是进行了简单的替换。我已经用select *替换了select count()以获取实际的更改。
In the Sink of this activity, just choose your storage account dataset created at the beginning of this journey and set the parameters required to allow this pipeline to be fully decoupled and reusable for any other solution you may have. :)
在此活动的接收器中，只需选择在此过程开始时创建的存储帐户数据集，并设置所需的参数，以使此管道完全解耦并可用于您可能拥有的任何其他解决方案。 :)

Now that our changes have been saved into a parquet file, we can add another Copy Data activity to transfer the changes from the parquet to the final database, Serverless Azure SQL Database. Here comes a bit of complexity, though.
现在，我们的更改已保存到镶木地板文件中，我们可以添加另一个“复制数据”活动，以将更改从镶木地板转移到最终数据库，即无服务器Azure SQL数据库。不过，这里有点复杂。
The Source is the same as the Sink from the previous Copy data activity. No differences.
源与上一个“复制”数据活动中的接收器相同。没有差异
The Sink of this new activity instead is the final database. Nevertheless, this time, you have to use the stored procedure created in the target database passing a couple of parameters. Watch out all the pipeline parameters I’m using to keep this pipeline’s activities fully decoupled.
相反，此新活动的接收者是最终数据库。不过，这一次，您必须使用在目标数据库中创建的存储过程，并传递几个参数。请注意我正在使用的所有管道参数，以使该管道的活动完全分离。
The parameters I’m passing to this store procedure are required to tell the SP which type of recordset we are passing to it and how to deal with that.
我要传递给此存储过程的参数是告诉SP传递给它的记录集的类型以及如何处理的参数。

Now you can run your pipeline in debug mode and check for any error. After you fix all of them, the final pipeline is ready to be embedded into a new one that will take care of dealing with parallel running, configuration management and some more tricky things you should consider before moving this pipeline to a production environment. ;)

现在，您可以在调试模式下运行管道，并检查是否有任何错误。修复所有问题之后，可以将最后一个管道嵌入到新管道中，该新管道将负责处理并行运行，配置管理以及在将该管道移至生产环境之前应考虑的一些更棘手的事情。 ;)

However, the final result is astonishing if you consider that this is based on a table with more than 2M records on a target database without indexes. It just took 1m and 4s end-to-end with 4 operations (3 updates and 1 insert). This results can also be improved a bit more by scaling a bit more the Serveless Azure SQL Database target database, adding one very simple index during the nightly load of this table and reduce the real-time daily streaming to be faster.

但是，如果您认为这是基于目标数据库中具有2M记录且无索引的表，则最终结果令人惊讶。端到端只花了1m和4s进行了4次操作(3次更新和1次插入)。通过更大程度地扩展Serveless Azure SQL数据库目标数据库，在此表的夜间加载期间添加一个非常简单的索引并减少实时实时流，还可以进一步改善此结果。

Hope this article helped you with an alternative approach to those you’ve already overlooked online for a near real-time data transfer using CDC through a “relatively” modern approach.

希望本文能为您提供一种替代方法，以帮助您通过一种“相对”现代的方法使用CDC进行在线实时数据传输，从而替代那些您已经在线忽略的方法。

Good luck!

祝好运！

If you liked this article and want to know more about me and the weird things I usually write, you can find me at my personal blog here where you can also find this article!

如果您喜欢这篇文章，并且想进一步了解我以及我平时写的怪异事物，可以在我的个人博客中找到我，在这里您也可以找到本文！