tidb数据库_异构数据库复制到TiDB

最新推荐文章于 2024-01-08 21:51:15 发布

weixin_26713521

最新推荐文章于 2024-01-08 21:51:15 发布

阅读量636

点赞数 1

文章标签：数据库 mysql sql postgresql database

原文链接：https://medium.com/swlh/heterogeneous-database-replication-to-tidb-c10478d11b29

版权

tidb数据库

This article is based on a talk given by Tianshuang Qin at TiDB DevCon 2020.

本文基于Tianshuang Qin在 TiDB DevCon 2020 上的演讲 。

When we convert from a standalone system to a distributed one, one of the challenges is migrating the database. We’re faced with questions such as:

当我们从独立系统转换为分布式系统时，挑战之一就是迁移数据库。我们面临的问题包括：

Should I migrate full or incremental data?
我应该迁移完整数据还是增量数据？
Should I perform an online or offline migration?
我应该执行在线还是离线迁移？
Should I use a ready-made data replication tool or develop a new one?
我应该使用现成的数据复制工具还是开发新工具？

When it comes to data migration, users are often faced with many options. At PingCAP, we’ve probably tried most of them. Over the years, we’ve migrated many heterogeneous databases between different database platforms and application scenarios. Today, I’ll save you some time by sharing with you the approaches that worked best.

当涉及数据迁移时，用户通常面临许多选择。在PingCAP，我们可能已经尝试了大多数。多年来，我们已经在不同的数据库平台和应用程序场景之间迁移了许多异构数据库。今天，我将与您分享最有效的方法，为您节省一些时间。

典型的数据库迁移过程 (A typical database migration process)

1.应用适应开发 (1. Application adaptation development)

Almost all early TiDB users have gone through this step. In version 3.0 or earlier, TiDB supports optimistic concurrency control and Repeatable Read (RR) isolation level, and its transaction size is limited to about 100 MB. Given these features and capacity, users need to put a lot of effort into adapting their applications. In contrast, TiDB 4.0 supports pessimistic concurrency control, Read Committed (RC) isolation, and large transactions with a maximum size of 10 GB. Users can adapt their applications to TiDB at a much lower cost.

几乎所有早期的TiDB用户都经历了这一步骤。在版本3.0或更早版本中，TiDB支持开放式并发控制和可重复读取(RR)隔离级别，并且其事务大小限制为大约100 MB。鉴于这些功能和功能，用户需要花很多精力来适应他们的应用程序。相比之下，TiDB 4.0支持悲观并发控制，读取提交(RC)隔离和最大10 GB的大型事务。用户可以以更低的成本使其应用程序适应TiDB。

2.应用验证测试 (2. Application verification testing)

There are two ways to perform application verification testing. You can combine the two methods to effectively verify your application.

有两种执行应用程序验证测试的方法。您可以结合使用这两种方法来有效地验证您的应用程序。

Image for post — *Application verification* *应用验证*

The first method is to test your application with production data. To do this, you must first use database replication technology to replicate the data from the production database to TiDB. Then, you use a testing application to perform a stress test. To stress TiDB and ensure that it will be stable in your production environment, apply a workload 10 to 20 times higher than your real production traffic. One of the advantages of replicating data from the production database to TiDB is that you avoid wide variations between test data and production data, which may cause many problems. For example, an SQL query, which has been tuned in the testing environment to achieve its highest performance, may become a slow SQL query in the production environment if the data is not replicated to TiDB for testing.

第一种方法是使用生产数据测试您的应用程序。为此，您必须首先使用数据库复制技术将数据从生产数据库复制到TiDB。然后，使用测试应用程序执行压力测试。为了给TiDB施加压力并确保它在生产环境中稳定，请施加比实际生产流量高10到20倍的工作负载。将数据从生产数据库复制到TiDB的优点之一是，避免了测试数据和生产数据之间的巨大差异，这可能会引起许多问题。例如，如果未将数据复制到TiDB进行测试，则已在测试环境中进行了调整以实现其最高性能SQL查询在生产环境中可能会成为缓慢SQL查询。

The second way to verify your application is to test it with production traffic. In this case, you must adopt a service bus similar to the enterprise service bus (ESB) for banks or message queuing technology. For example, you can use the Kafka message queuing mechanism to implement the multi-path replication of production traffic. Whether an application can successfully run in the production environment depends on the main path of the existing production database. There is also a bypass for the application. We can load an application that has been adapted to TiDB on the bypass and connect the application to TiDB for application verification.

验证应用程序的第二种方法是使用生产流量对其进行测试。在这种情况下，您必须为银行或消息排队技术采用类似于企业服务总线(ESB)的服务总线。例如，您可以使用Kafka消息排队机制来实现生产流量的多路径复制。应用程序能否在生产环境中成功运行取决于现有生产数据库的主要路径。该应用程序还有一个旁路。我们可以在旁路上加载适合TiDB的应用程序，并将该应用程序连接到TiDB以进行应用程序验证。

3.迁移测试 (3. Migration testing)

Migration testing mainly involves verifying operations completed during the maintenance window. For example, you must follow the migration activity specified in the migration manual in advance to verify that the manual is correct and to determine whether the migration maintenance window is long enough to perform the migration. You also need to perform rollback testing, because if your deployment to production fails, you may need to roll back to the previous database.

迁移测试主要涉及验证在维护时段内完成的操作。例如，您必须事先遵循迁移手册中指定的迁移活动，以验证该手册是否正确，并确定迁移维护窗口是否足够长以执行迁移。您还需要执行回滚测试，因为如果到生产的部署失败，则可能需要回滚到先前的数据库。

4.数据迁移和生产数据库切换 (4. Data migration and production database switch)

Your applications may run 24/7 or you may only have a short maintenance window to switch over databases, so you must migrate your data before the maintenance window ends. To do that, you must use heterogeneous database replication technology. During the maintenance window, you can stop all running applications, replicate incremental data to the secondary database, perform a comparison to ensure that the secondary database is synchronized with the primary database, and then verify applications. Once the application verification testing is completed, database switchover starts. If the switchover is successful, TiDB will run as a primary database in the production environment.

您的应用程序可能运行24/7，或者您只有一个短暂的维护窗口来切换数据库，因此您必须在维护窗口结束之前迁移数据。为此，您必须使用异构数据库复制技术。在维护窗口期间，您可以停止所有正在运行的应用程序，将增量数据复制到辅助数据库，进行比较以确保辅助数据库与主数据库同步，然后验证应用程序。一旦完成应用程序验证测试，就将开始数据库切换。如果切换成功，则TiDB将在生产环境中作为主数据库运行。

数据库复制的应用场景 (Application scenarios of database replication)

迁移资料 (Migrating data)

We’ve talked about this application scenario in the previous section.

在上一节中，我们已经讨论了此应用程序场景。

创建灾难恢复数据库 (Creating a disaster recovery database)

If you use Oracle as the primary database, you can use TiDB as its disaster recovery database. If you’ve just deployed a TiDB database in the production environment without sufficient verification, you can use an Oracle database as the disaster recovery database for TiDB. That way, if the TiDB database suddenly crashes, you can promptly migrate the data back to the original production database.

如果将Oracle用作主数据库，则可以将TiDB用作其灾难恢复数据库。如果您只是在生产环境中部署TiDB数据库而没有足够的验证，则可以将Oracle数据库用作TiDB的灾难恢复数据库。这样，如果TiDB数据库突然崩溃，则可以立即将数据迁移回原始生产数据库。

创建只读或存档数据库 (Creating a read-only or archive database)

First, let’s look at the application scenario of building a read-only database. This is applicable to some bank clients. A bank’s core services run in a closed system, and it may be impossible to migrate them to an open platform or a distributed database in a short time. However, some read-only applications, such as querying account details, bills, or monthly statements on the app client, can be completed without accessing the core production database, which only processes real transactions. Therefore, we can use database replication technology to replicate such read-only application data from the production database to the TiDB database and perform the read-only operations only in the TiDB database.

首先，让我们看一下构建只读数据库的应用场景。这适用于某些银行客户。银行的核心服务在封闭的系统中运行，因此可能无法在短时间内将其迁移到开放平台或分布式数据库。但是，某些只读应用程序(例如查询应用程序客户端上的帐户明细，账单或月结单)可以完成，而无需访问仅处理真实交易的核心生产数据库。因此，我们可以使用数据库复制技术将此类只读应用程序数据从生产数据库复制到TiDB数据库，并仅在TiDB数据库中执行只读操作。

Another scenario is building an archive database. If you use a traditional standalone database for production and its capacity is limited, but your application data is growing quickly, the data cannot be migrated to a distributed database in a short time. A solution is to save data in the production database for a specific period (for example, 30 or 40 days), delete expired data from the production database, and store the deleted data in TiDB. That is, the deletion operation is performed only in the production database, and TiDB is used as an archive database.

另一种情况是建立档案数据库。如果您使用传统的独立数据库进行生产并且其容量有限，但是您的应用程序数据正在快速增长，则无法在短时间内将数据迁移到分布式数据库。一种解决方案是在生产数据库中保存特定时间段(例如30或40天)的数据，从生产数据库中删除过期的数据，然后将已删除的数据存储在TiDB中。即，仅在生产数据库中执行删除操作，并且TiDB用作存档数据库。

汇总来自多个来源的数据 (Aggregating data from multiple sources)

You can use TiDB as a data hub. You might run multiple applications in Online Transactional Processing (OLTP) databases and want to use the database replication technology to aggregate data from multiple sources to one TiDB database. Then, you can perform in-depth analysis on or read-only queries in the TiDB database. The main challenge for multi-source aggregation lies in the cross-database query after data is successfully aggregated to the TiDB database. The data may come from heterogeneous databases. It is impossible to create database links among them as database links can only be created among Oracle databases. To solve this problem, you can use heterogeneous database replication and use the TiDB database in a role similar to a widely deployed operational data store (ODS) for aggregating data.

您可以将TiDB用作数据中心。您可能在在线事务处理(OLTP)数据库中运行多个应用程序，并希望使用数据库复制技术将数据从多个来源聚合到一个TiDB数据库。然后，您可以对TiDB数据库中的只读查询执行深入分析。多源聚合的主要挑战在于将数据成功聚合到TiDB数据库后的跨数据库查询。数据可能来自异构数据库。由于只能在Oracle数据库之间创建数据库链接，因此无法在它们之间创建数据库链接。要解决此问题，您可以使用异构数据库复制并以类似于广泛部署的操作数据存储(ODS)的角色使用TiDB数据库来聚合数据。

复制异构数据库 (Replicating heterogeneous databases)

This section discusses some commonly used heterogeneous database replication methods.

本节讨论一些常用的异构数据库复制方法。

通过接口文件进行数据传输 (Data transfer via interface files)

This method is widely used when transferring data between OLTP and Online Analytical Processing (OLAP) systems. As the data transfer involves two different systems, it’s difficult to connect two database networks. Databases belong to backend systems. For security reasons, it is not suitable to directly connect them.

在OLTP和联机分析处理(OLAP)系统之间传输数据时，此方法被广泛使用。由于数据传输涉及两个不同的系统，因此很难连接两个数据库网络。数据库属于后端系统。出于安全原因，不适合直接连接它们。

A comma-separated values (CSV) file is a typical interface file. The interface file here refers to the file generated by an application, based on the predefined format and rules for adding delimiters and line breaks. After receiving a generated interface file, the receiving end parses the interface file based on the agreed format, converts the file into an INSERT statement, and inserts it into the target database.

逗号分隔值(CSV)文件是典型的接口文件。此处的接口文件是指由应用程序根据预定义的格式以及添加定界符和换行符的规则生成的文件。接收到生成的接口文件后，接收端会根据约定的格式解析接口文件，然后将其转换为INSERT语句，然后将其插入目标数据库。

The advantage of this method is that it applies to any database. As long as the upstream and downstream databases support standard SQL interfaces, you can transfer data through an interface file.

这种方法的优点是它适用于任何数据库。只要上游和下游数据库支持标准SQL接口，您就可以通过接口文件传输数据。

However, this approach has several disadvantages:

但是，这种方法有几个缺点：

It requires additional development in your application code. For example, if the application was originally developed in Java, you need to add more programming logic. If you add logic to the upstream database code that generates an interface file, you also need to add logic to the downstream database code that imports the interface file. Moreover, to improve performance, you may need to control the concurrency of the file import.
它需要在应用程序代码中进行其他开发。例如，如果应用程序最初是用Java开发的，则需要添加更多的编程逻辑。如果将逻辑添加到生成接口文件的上游数据库代码中，则还需要将逻辑添加到导入接口文件的下游数据库代码中。此外，为了提高性能，您可能需要控制文件导入的并发性。
Interface files are only useful for full refresh and append write operations. It is difficult to obtain data changes generated by UPDATE and DELETE operations through an interface file.
接口文件仅对完全刷新和追加写入操作有用。通过接口文件很难获得由UPDATE和DELETE操作生成的数据更改。
Data may not be timely. As 5G technology gradually rolls out, terminal devices require lower latency. For example, banks are gradually changing from the traditional T+1 analytics to T+0 or even near real-time analytics. When you transfer data using an interface file, it’s hard to ensure that the data is timely. This is because the interface file triggers file loading at a specific time with low frequency and efficiency.
数据可能不及时。随着5G技术的逐步推出，终端设备需要更低的延迟。例如，银行正在逐渐从传统的T + 1分析变为T + 0甚至接近实时分析。使用接口文件传输数据时，很难确保数据及时。这是因为接口文件在特定时间以较低的频率和效率触发文件加载。
When data is exported, the upstream database must be scanned extensively to access the data through the SQL interface. This may affect performance. Therefore, as a common practice, the upstream application will open an SQL interface in the secondary database for exporting a read-only file to the downstream database.
导出数据时，必须对上游数据库进行全面扫描以通过SQL界面访问数据。这可能会影响性能。因此，通常的做法是，上游应用程序将在辅助数据库中打开一个SQL接口，以将只读文件导出到下游数据库。

开发ETL作业并安排该作业以进行数据传输 (Developing an ETL job and scheduling the job for data transfer)

You can develop an extract, transform, load (ETL) job and schedule the job on a regular basis to transfer data. This method is commonly applied to data transfer and processing between OLTP and OLAP systems.

您可以开发提取，转换，加载(ETL)作业并定期计划该作业以传输数据。此方法通常应用于OLTP和OLAP系统之间的数据传输和处理。

If you need to run ETL jobs for a long time, you may take a long time to obtain the incremental data and write it to the target database. This requires the ability to schedule ETL jobs, which involves additional development.

如果需要长时间运行ETL作业，则可能需要很长时间才能获取增量数据并将其写入目标数据库。这要求能够安排ETL作业，这涉及其他开发。

Using an ETL job has the following advantages:

使用ETL作业具有以下优点：

Just like an interface file, an ETL job uses SQL interfaces and is applicable to any database. As long as the upstream and downstream databases support SQL standards, you can use ETL jobs.
就像接口文件一样，ETL作业使用SQL接口，并且适用于任何数据库。只要上游和下游数据库都支持SQL标准，就可以使用ETL作业。
Additionally, you can process data during the ETL job. If the upstream and downstream databases have different table schemas, or if you need to add logic to the table schema, ETL jobs are the best choice.
此外，您可以在ETL作业期间处理数据。如果上游数据库和下游数据库具有不同的表架构，或者需要在表架构中添加逻辑，则ETL作业是最佳选择。

The disadvantages of an ETL job are similar to those of using an interface file:

ETL作业的缺点类似于使用接口文件的缺点：

An ETL job requires additional development. You need to create a set of independent SQL jobs and build up a scheduling system.
ETL作业需要额外的开发。您需要创建一组独立SQL作业并建立调度系统。
The data changes incurred by UPDATE and DELETE operations are difficult to obtain via ETL jobs. Compared to using an interface file, the timeliness of ETL may be slightly better, but it depends on the scheduling frequency. However, the scheduling frequency is actually related to the processing time required by the job after each scheduling. For example, when data is imported each time, if a job requires 5 minutes for processing, the delay may be as long as 5 to 10 minutes.
UPDATE和DELETE操作引起的数据更改很难通过ETL作业获得。与使用接口文件相比，ETL的及时性可能会稍好一些，但它取决于调度频率。但是，调度频率实际上与每次调度后作业所需的处理时间有关。例如，每次导入数据时，如果作业需要5分钟进行处理，则延迟可能长达5到10分钟。
To access the data through the SQL interface, extensive scanning of the upstream database is required. This may affect performance.
要通过SQL界面访问数据，需要对上游数据库进行全面扫描。这可能会影响性能。

使用CDC工具 (Using a CDC tool)

We recommend that you use change data capture (CDC) tools to replicate heterogeneous databases. There are many CDC tools, such as Oracle GoldenGate (OGG), IBM InfoSphere CDC, and TiDB Data Migration (DM).

我们建议您使用更改数据捕获(CDC)工具来复制异构数据库。有许多CDC工具，例如Oracle GoldenGate(OGG)，IBM InfoSphere CDC和TiDB Data Migration (DM)。

The following table summarizes the advantages and disadvantages of using CDC tools. As you can see, there are far more advantages.

下表总结了使用CDC工具的优缺点。如您所见，还有更多优势。

AdvantagesDisadvantagesYour application requires no additional development.

优点缺点您的应用程序不需要其他开发。

CDC tools can obtain all DML changes, like DELETE and UPDATE.

CDC工具可以获取所有DML更改，例如DELETE和UPDATE 。

Because the workload is distributed through the day, these tools have higher performance.

由于工作量是全天分配的，因此这些工具具有更高的性能。

CDC tools bring low latency and near real-time replication.

CDC工具带来低延迟和近乎实时的复制。

Upstream data is obtained by reading redo logs, which does not impact the SQL performance.

通过读取重做日志可以获取上游数据，这不会影响SQL性能。

CDC tools are mostly commercial products, and you need to purchase them.

CDC工具大部分是商业产品，您需要购买它们。

Most CDC tools only allow a specific database as an upstream database. If you have multiple types of upstream databases, you need to use multiple CDC tools.

大多数CDC工具仅允许将特定数据库作为上游数据库。如果您有多种类型的上游数据库，则需要使用多个CDC工具。

TiDB中异构数据库复制的最佳实践 (Best practices for heterogeneous database replication in TiDB)

I’d like to offer some best practice tips for heterogeneous database replication in TiDB:

我想为TiDB中的异构数据库复制提供一些最佳实践提示：

Tips based on replication tasks:

根据复制任务的提示：

If you want to replicate incremental changes incurred by operations such as UPDATE and DELETE, a CDC tool is your best choice.
如果要复制由UPDATE和DELETE等操作引起的增量更改，则CDC工具是最佳选择。
If you want full data replication, you can use lightweight ETL tools dedicated to data migration such as Kettle or DataX. You do not need to purchase CDC tools or build other architectures to complete the replication. Instead, you only need to ensure that the ETL tool can access the upstream and downstream databases simultaneously and perform ETL jobs to do the full data replication.
如果要进行完整的数据复制，则可以使用专用于数据迁移的轻型ETL工具，例如Kettle或DataX。您无需购买CDC工具或构建其他体系结构即可完成复制。相反，您只需要确保ETL工具可以同时访问上游和下游数据库并执行ETL作业即可进行完整的数据复制。

Tips based on scenarios:

根据方案的提示：

If you are creating a disaster recovery database, creating a read-only or archive database, or aggregating data from multiple sources, we recommend that you use a CDC tool for data replication. If you use ETL jobs in these scenarios to obtain all DML changes, the development costs will be very high.
如果要创建灾难恢复数据库，创建只读数据库或存档数据库，或者聚合来自多个源的数据，建议您使用CDC工具进行数据复制。如果在这些情况下使用ETL作业来获取所有DML更改，则开发成本将非常高。

If your upstream database is a MySQL-like database or a MySQL-based database (such as Amazon RDS on the public cloud, including Aurora and some sharding products developed based on MySQL), you can use the TiDB DM tool for data transfer. For more information about DM, see TiDB Data Migration Overview.

如果上游数据库是类似MySQL的数据库或基于MySQL的数据库(例如公共云上的Amazon RDS，包括Aurora和一些基于MySQL开发的分片产品)，则可以使用TiDB DM工具进行数据传输。有关DM的更多信息，请参见TiDB数据迁移概述。

If you have any questions on any of the topics we covered today, you can join our community on Slack, and send us your feedback.

如果您对我们今天讨论的任何主题有任何疑问，可以加入Slack上的社区，并将您的反馈发送给我们。

Originally published at www.pingcap.com on July 30, 2020

最初于 2020年7月30日 在 www.pingcap.com 上发布

翻译自: https://medium.com/swlh/heterogeneous-database-replication-to-tidb-c10478d11b29

tidb数据库

weixin_26713521

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
tidb数据库_异构数据库复制到TiDB

This article is based on a talk given by Tianshuang Qin at TiDB DevCon 2020. 本文基于Tianshuang Qin在 TiDB DevCon 2020 上的演讲。 When we convert from a standalone system to a distributed one, one of the chal...
复制链接

扫一扫