具有始终可用可用性组的灾难恢复计划

最新推荐文章于 2022-02-09 10:20:19 发布

culuo4781

最新推荐文章于 2022-02-09 10:20:19 发布

阅读量419

点赞数

文章标签：数据库大数据分布式 python mysql

原文链接：https://www.sqlshack.com/disaster-recovery-planning-always-availability-groups/

版权

When I configured my first Always-On Availability Group, I setup a Windows Cluster and started with SQL Server Management Studio’s New Availability Group Wizard, scripting out the steps along the way. This entire process took only a matter of minutes. The minimum required steps for configuring the cluster and getting a database into an Availability Group (AG) are very few. This process, however, is deceptively simple. What some don’t realize is that the majority of work required for setting up an AG needs to occur in the planning phase, before a server is even requisitioned. Let us examine the questions that you should ask before implementing an AG.

当我配置了第一个Always-On可用性组时，我设置了Windows群集，并从SQL Server Management Studio的New Availability Group Wizard开始 ，在此过程中编写了步骤。整个过程仅需几分钟。配置群集并将数据库放入可用性组（AG）所需的最少步骤很少。但是，此过程看似简单。有些人没有意识到，设置AG所需的大部分工作都需要在计划阶段进行，甚至还不需要服务器。让我们检查一下在实施AG之前应该问的问题。

您将如何连接到SQL Server？ (How will you connect to SQL Server?)

When it comes to fail-over technologies, one critical aspect is how you will connect to the database instance after the fail-over occurs. Whether it is the application configurations, DNS aliasing, or cluster resources, something must identify where to route new connections. When using an Availability Group you may choose to handle this routing separate from SQL Server or manually reconfigure connection strings, but that is not necessary. The AG Listener is a wonderful tool for connection routing that is fail-over aware.

对于故障转移技术，一个关键方面是发生故障转移后如何连接到数据库实例。无论是应用程序配置，DNS别名还是群集资源，都必须标识出将新连接路由到何处。使用可用性组时，您可以选择与SQL Server分开处理此路由，也可以手动重新配置连接字符串，但这不是必需的。 AG侦听器是用于故障转移意识的连接路由的绝佳工具。

Here is a quick overview of the feature.
这是此功能的快速概述。
Here is a detailed breakdown of the MultiSubnetFailover and ApplicationIntent parameters.
这是 MultiSubnetFailover和ApplicationIntent参数的详细分类。
Here are procedures for connecting with SQL Server Management Studio.
以下是与SQL Server Management Studio连接的过程。

The concern with this feature is that, in many architectures, you need the use of the MultiSubnetFailover and / or ApplicationIntent parameters. Unfortunately, these parameters are not supported with certain drivers, such as, OLEDB and the .NET framework versions below 3.5 with SP1 missing hotfix KB2654347 . See this blog post for more details. There is also additional connection overhead, if you are using ApplicationIntent for read-only routing.

此功能的问题在于，在许多体系结构中，您需要使用MultiSubnetFailover和/或ApplicationIntent参数。不幸的是，某些驱动程序不支持这些参数，例如OLEDB和低于3.5且SP1缺少修补程序KB2654347的.NET Framework版本。有关更多详细信息，请参见此博客文章。如果您将ApplicationIntent用于只读路由，则还存在额外的连接开销。

Provider	Multi-Subnet Failover	ApplicationIntent	Read-Only Routing
SQL Naïve Client 11.0 ODBC	Yes	Yes	Yes
SQL Native Client 11.0 OLEDB	No	Yes	Yes
ADO.NET w/ .NET Fx 4.0 Up. 4.0.3	Yes	Yes	Yes
ADO.NET with .NET Fx 3.5	No	Yes	Yes
Microsoft JDBC 4.0 for SQL Server	Yes	Yes	Yes

提供者	多子网故障转移	申请意向	只读路由
SQL天真客户端11.0 ODBC	是	是	是
SQL Native Client 11.0 OLEDB	没有	是	是
带有.NET Fx 4.0的ADO.NET。 4.0.3	是	是	是
带有.NET Fx 3.5的ADO.NET	没有	是	是
用于SQL Server的Microsoft JDBC 4.0	是	是	是

您可以卸载读物吗？ (Can you off-load your reads?)

Availability Groups allow you to off-load some reads to secondary replicas, but can you use this feature? The first concern has already been mentioned above. Are your clients using drivers that support the ApplicationIntent parameter? Often the core of a system is easy enough to identify. Your application might be written with .NET 4.0 or later but the investigation cannot end there. Integration points, both current and future, must be considered. For example, at the time of publishing, this article OLEDB is still a commonly used driver for SQL Server connections from SQL Server Integration Service packages. If ETL processes are going to induce heavy read loads on your server, it is ideal to have them read from a secondary replica. Driver usage must be considered for these auxiliary connections.

可用性组允许您将某些读取卸载到辅助副本，但是您可以使用此功能吗？上面已经提到了第一个问题。您的客户端使用的驱动程序是否支持ApplicationIntent参数？通常，系统的核心很容易识别。您的应用程序可能是用.NET 4.0或更高版本编写的，但调查无法在那里结束。必须考虑当前和将来的集成点。例如，在发布时，本文OLEDB仍然是来自SQL Server Integration Service包SQL Server连接的常用驱动程序。如果ETL进程将在服务器上引起大量读取负载，则理想的是从辅助副本读取它们。这些辅助连接必须考虑使用驱动程序。

The second aspect to consider is whether your application and processes are designed in a manner which can take advantage of the read operation off-loading. Taking advantage of off-loaded reads requires the ApplicationIntent parameter which means that a separate connection string must be used. In addition to the connection string, your transactions will behave differently.

要考虑的第二个方面是，您的应用程序和流程的设计方式是否可以利用读取操作的卸载优势。利用卸载的读取需要ApplicationIntent参数，这意味着必须使用单独的连接字符串。除了连接字符串之外，您的交易行为也会有所不同。

As per MSDN , “read-only workloads for disk-based tables use row versioning to remove blocking contention on the secondary databases. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set. Also, all locking hints are ignored. This eliminates reader/writer contention.”

根据MSDN ， 基于磁盘的表的只读工作负载使用行版本控制来消除辅助数据库上的阻塞争用。 即使明确设置了其他事务隔离级别，针对辅助数据库运行的所有查询也会自动映射到快照隔离事务级别。 此外，所有锁定提示都将被忽略。 这消除了读者/作家的争执。 ”

With the mandate of snapshot isolation being used on your secondary replicas, it is critical that the business, and the application doing the reads, are OK with the potential for phantom inserts and transactions which do not guarantee repeatable reads.

由于在辅助副本上使用了快照隔离的要求，因此至关重要的是，业务和执行读取的应用程序必须能够进行幻像插入和事务处理，从而不能保证可重复读取。

您可以卸载备份吗？ (Can you off-load your backups?)

The idea of off-loading your backups to a secondary replica is particularly enticing. When I first heard about this feature I was already spending server resources in my mind to get backup times reduced. However, there are a couple of very important caveats with backup off-loading.

将备份卸载到辅助副本的想法特别诱人。当我第一次听说此功能时，我已经在脑海中花费服务器资源以减少备份时间。但是，在卸载备份方面有两个非常重要的警告。

The most disappointing caveat is that only COPY_ONLY full backups are allowed on the secondary replicas. See Microsoft’s explanation here . The reason for this is that writes cannot occur on the read-only replica. This means that the log chain can’t be impacted and the differential bitmap cannot be cleared. This spoils backup off-loading for me because your full backups are likely to be your most resource intense, run the longest, and are required for point-in-time recovery and for the ability to append differential and/or log backups. The final caveat is that differential backups cannot be taken on secondary replicas, leaving us with only log backups which can be taken in a normal manner.

最令人失望的警告是，辅助副本上仅允许COPY_ONLY完整备份。请参阅Microsoft的解释在这里。原因是写操作不能在只读副本上发生。这意味着日志链不会受到影响，差分位图也无法清除。这对我来说不利于备份的卸载，因为您的完整备份可能最耗费资源，运行时间最长，并且是时间点恢复以及追加差异和/或日志备份功能所必需的。最后要说明的是，差异备份不能在辅助副本上进行，而只剩下可以正常方式进行的日志备份。

您将如何管理服务器对象？ (How will you manage server objects?)

Availability Groups establish mirroring which is scoped to the database. If you are using contained databases and do not use the SQL Agent, linked servers, credentials, proxies, or any other server level object, then jump down to the next question. For the rest of us, these server objects are important and must be synchronized between the replicas manually. This can be a management challenge if not planned well.

可用性组建立镜像范围，该镜像作用于数据库。如果您使用的是包含的数据库，并且不使用SQL代理，链接的服务器，凭据，代理或任何其他服务器级别的对象，请跳至下一个问题。对于我们其他人来说，这些服务器对象很重要，必须在副本之间手动进行同步。如果规划不当，这可能是管理上的挑战。

One company that I worked with was using SSDT database projects for deployment. What we ended up doing was creating a PowerShell script which first deployed a T-SQL script to all nodes which included all of the server object creation statements wrapped in a large number of IF NOT EXISTS and DROP / CREATE statements. Then the dacpac would deploy to the primary node since all of the server dependencies had just been created. However it gets accomplished, it is a process which you will have to establish yourself and it is required. The last thing that you want is for an automatic fail-over to occur and you learn that the secondary replica didn’t have your service account logins on it. Suddenly the illusion of high-availability disappears.

与我合作的一家公司正在使用SSDT数据库项目进行部署。我们最终要做的是创建一个PowerShell脚本，该脚本首先将T-SQL脚本部署到所有节点，这些节点包括用大量IF NOT EXISTS和DROP / CREATE语句包装的所有服务器对象创建语句。然后，由于所有服务器依赖项都已创建，因此dacpac将部署到主节点。无论它完成了多少，这都是您必须建立自己的过程并且是必需的。您需要做的最后一件事是进行自动故障转移，并且您了解到辅助副本上没有您的服务帐户登录名。突然，高可用性的幻想消失了。

我的中学和我的中学一样吗？ (Is my secondary the same as my primary?)

Your secondary replicas are transactionally consistent (exact copies) of the primary. But exact might not be as exact as you think. Take the asynchronous commit mode for a moment. This mode means that the transactions will be asynchronously sent from the primary replica to all secondary replicas. At any given moment, your secondary replicas might be different due to latency. When it is small, delay can be understood and easily accepted for most applications. However, a 1 second latency in data replication can play tricks on any application that is designed to perform reads and writes and is instructed to use two different replicas for these operations. This small difference between replicas can cause unintentional reprocessing of records. Process flow should be examined and not all read workloads should be pushed to a secondary replica.

您的辅助副本与主副本在事务上是一致的（精确副本）。但是精确程度可能不如您想象的那么精确。暂时使用异步提交模式。此模式意味着事务将从主副本异步发送到所有辅助副本。在任何给定时刻，由于延迟，您的辅助副本可能会有所不同。当它很小时，对于大多数应用程序，延迟可以理解并且容易接受。但是，数据复制中1秒钟的延迟会在设计为执行读写的任何应用程序上发挥作用，并被指示使用两个不同的副本进行这些操作。副本之间的微小差异可能会导致记录的意外重新处理。应该检查流程，而不是将所有读取的工作负载都推送到辅助副本。

Statistics stored in the database are technically identical to the primary as well. Statistics created on the primary will replicate over to the secondary replicas and all statistic maintenance will have to occur on the primary. However, it is best practice to allow auto update and auto creation of statistics based on the queries that a SQL Server instance receives. If the secondary replicas could not create statistics, then performance of read-only workloads could suffer or you would have to spend additional time manually identifying missing statistics and creating them. To mitigate this issue, SQL Server will create statistics in tempdb.

技术上，存储在数据库中的统计信息也与主数据库相同。在主数据库上创建的统计信息将复制到辅助副本，并且所有统计信息维护都必须在主数据库上进行。但是，最佳做法是允许基于SQL Server实例接收的查询自动更新和自动创建统计信息。如果辅助副本无法创建统计信息，则只读工作负载的性能可能会受到影响，或者您将不得不花费额外的时间手动识别丢失的统计信息并创建它们。为了缓解此问题，SQL Server将在tempdb中创建统计信息。

As per MSDN , “the secondary replica creates and maintains temporary statistics for secondary databases in tempdb. The suffix _readonly_database_statistic is appended to the name of temporary statistics to differentiate them from the permanent statistics that are persisted from the primary database.” This behavior can put your system into a condition where statistics are different on each secondary replica which can complicate query performance tuning.

按照MSDN的要求，“ 辅助副本创建并维护tempdb中辅助数据库的临时统计信息。 后缀_readonly_database_statistic附加到临时统计信息的名称上，以将其与从主数据库保留的永久统计信息区分开。 ”这种行为会使您的系统处于一种状况，即每个辅助副本上的统计信息都不同，这会使查询性能调整变得复杂。

With tempdb statistics the database is still a mirror of the primary but its behavior may not be, hopefully for the better. Another slight difference that might occur is with the file structure of your database. As noted in, Troubleshooting a Failed Add-File Operation , all file operations executed with T-SQL on the primary will be replicated to the secondary replicas. These operations send over the literal file path that is used on the primary, without transformation. Therefore, if the file structures do not match between the replicas, then a synchronization failure occurs. This operation succeeds on the primary but the secondary’s failure stops data traveling to that replica, causes the primary to begin stock piling transaction log records, and must be re-initialized from backup or removed from the AG. When initializing an Availability Group replica with different file structures (including drive letter) you will need to utilize the RESTORE command’s MOVE clause and follow the procedures for Manually Prepare a Secondary Database for an Availability Group (SQL Server). I highly recommend avoiding this problem, if you can, by making sure the drive letters and folder structures match on all replicas of an AG.

使用tempdb统计信息，数据库仍然是主数据库的镜像，但它的行为可能不是，希望更好。可能发生的另一个细微差别是数据库的文件结构。如对失败的添加文件操作进行故障排除中所述，在主数据库上使用T-SQL执行的所有文件操作都将被复制到辅助副本。这些操作通过主服务器上使用的原义文件路径发送，而无需转换。因此，如果副本之间的文件结构不匹配，则会发生同步失败。此操作在主服务器上成功完成，但是辅助服务器的故障停止了数据传输到该副本，导致主服务器开始进行库存堆事务日志记录，并且必须从备份中重新初始化或从AG中删除。在初始化具有不同文件结构（包括驱动器号）的可用性组副本时，您将需要利用RESTORE命令的MOVE子句，并遵循为可用性组手动准备辅助数据库（SQL Server）的过程。我强烈建议通过确保驱动器号和文件夹结构在AG的所有副本上都匹配来避免此问题。

为什么需要集群，这对我有什么影响？ (Why do I need a cluster and how does that affect me?)

A Windows Fail-over Cluster was not a requirement for Database Mirroring, the feature that serves as the foundation for Availability Groups. With AGs, a typical multi- data center architecture might look like this.

Windows故障转移群集不是数据库镜像的要求，该功能是可用性组的基础。对于AG，典型的多数据中心架构可能看起来像这样。

The most important aspect of this change in requirements is how you configure quorum. With Database Mirroring, your database was either available or not, and your witness either could communicate with the replicas or could not. Clustering provides advanced configurations for various servers, file shares, or disks to vote on who has quorum but also can be unforgiving. Misconfigured quorum can cause a complete cluster outage even when normal events take place. Using the example image above, this architecture crosses into two sub-nets because replicas A and B are in data center 1 and replicas C and D are in data center 2. If connectivity between the data centers is less than ideal, a minor blip in the connection could cause a loss of heartbeat singles. When quorum is configured correctly there will also be an odd number of votes still online but, if not planned out thoroughly, this normal event could cause the cluster service on all nodes to shut down. This is a safety feature designed to prevent a split-brain effect where two halves of the cluster each think that they are the real cluster.

需求更改中最重要的方面是如何配置仲裁。使用数据库镜像，您的数据库要么可用，要么不可用，见证人可以与副本进行通信，也可以不与副本进行通信。群集可为各种服务器，文件共享或磁盘提供高级配置，以投票决定谁拥有法定人数，但也有可能是无情的。错误配置的仲裁可能导致整个群集中断，即使发生正常事件也是如此。使用上面的示例图像，由于副本A和B在数据中心1中，副本C和D在数据中心2中，因此该体系结构跨越了两个子网。如果数据中心之间的连接性不理想，则出现一个小问题连接可能会导致心跳单打丢失。如果正确配置了仲裁，那么仍然会有奇数个投票保持联机状态，但是，如果没有彻底计划，此正常事件可能会导致所有节点上的群集服务关闭。这是一项安全功能，旨在防止出现裂脑效应，在这种情况下，群集的两半都认为它们是真实的群集。

The next consideration is operating system versions. Naturally, making sure that you can comply with supported versions is required. Windows Server 2008 and above all support AGs but, for versions below 2012, you will need to install service packs and hotfixes in order to move forward. See this table of prerequisites for more details. Also, with Database Mirroring, it is not a requirement that all OS’s match. With AGs you need a cluster, and with a cluster you need matching OSs. This, alone, might throw a wrench in your plans, but if it doesn’t, you still need to pre-plan your upgrade paths and make sure that you are OK with the limitations. At some point a new OS version will release and you will, eventually, want to upgrade. You should plan in advance how this can happen. Would you attempt in-place upgrades? I probably wouldn’t but there is still the choice between side-by-side installations with database migrations or Cross-cluster Migration of AlwaysOn Availability Groups for Operating System Upgrades .

下一个考虑因素是操作系统版本。自然，需要确保您可以符合支持的版本。 Windows Server 2008及更高版本均支持AG，但对于2012年以下的版本，您需要安装Service Pack和修补程序才能继续。有关更多详细信息，请参阅此先决条件表。同样，对于数据库镜像，也不要求所有操作系统都匹配。使用AG，您需要一个集群，而使用集群，则需要匹配的OS。仅此一项，可能会给您的计划带来麻烦，但是如果没有，您仍然需要预先计划升级路径，并确保您对这些限制表示满意。在某个时候，将发布新的OS版本，您最终将需要升级。您应该提前计划如何发生。您会尝试就地升级吗？我可能不会，但是在通过数据库迁移的并行安装或用于操作系统升级的AlwaysOn可用性组的跨集群迁移之间仍然可以选择。

Finally we come to a decision around data duplication and SQL Server Fail-over Cluster Instances (FCI). My preferred architecture for an AG is to have stand-alone instances using locally attached storage. In that case, the SQL Server services would not be a part of a cluster resource, only the AG and its listener. However, I have had clients who have very large databases (VLDB) and they do not want to spare the expense of having data duplicated within a single data center. In that case, a hybrid approach of AGs and FCIs can be configured so that shared storage is used within each datacenter but AG replication is used to synchronize the other data center. That architecture would look something like this

最后，我们决定围绕数据复制和SQL Server故障转移群集实例（FCI）。对于AG，我首选的体系结构是使用本地连接的存储具有独立实例。在这种情况下，SQL Server服务将不会成为群集资源的一部分，而只是AG及其侦听器。但是，我有一些拥有非常大的数据库（VLDB）的客户，他们不想浪费在单个数据中心内复制数据的开销。在那种情况下，可以配置AG和FCI的混合方法，以便在每个数据中心内使用共享存储，但使用AG复制来同步另一个数据中心。该架构看起来像这样

Using a hybrid approach of this type adds much complexity. Beyond the manageability of a more complicated architecture, AG automatic fail-over is not possible in this configuration. The only automatic fail-over that is supported will be at the instance level, performed by the FCI. This means that automatic fail-over cannot be configured to move different AGs from the same instance to separate instances. The whole instance will move. Another concern is that only a single instance with an AG can ever reside on a node of the cluster. To state it another way, if you have more than one instance in the cluster with an AG, then they must be configured to only move between nodes which are not possible owners of the other. To achieve automatic fail-over in a cluster with two FCIs, you will need a minimum of four nodes which might make a strong case for simply splitting them into two separate clusters to begin with.

使用这种类型的混合方法会增加很多复杂性。除了更复杂的体系结构的可管理性之外，在这种配置中，AG自动故障转移是不可能的。支持的唯一自动故障转移将是在实例级别，由FCI执行。这意味着无法配置自动故障转移以将不同的AG从同一实例移动到单独的实例。整个实例将移动。另一个问题是，只有具有AG的单个实例才能驻留在集群的节点上。换句话说，如果群集中具有AG的实例中有多个，则必须将它们配置为仅在不可能成为另一个所有者的节点之间移动。为了在具有两个FCI的群集中实现自动故障转移，您至少需要四个节点，这可能是一个很好的理由，可以将它们简单地分为两个独立的群集。

从这往哪儿走 (Where to go from here)

After asking all those questions, it is now clear that the T-SQL commands or the SSMS wizard to configure an Availability Group are only a small part of the process required to be successful with this technology. Building out an AG is about paying a man hour cost for planning up-front in exchange for long-term stability. When thought through thoroughly, AGs provide faster recovery times then log shipping, far less maintenance than replication, no dependency on shared storage unlike FCIs, and can fail-over database groups unlike Database Mirroring.

在问完所有这些问题之后，现在很明显，使用T-SQL命令或SSMS向导配置可用性组只是成功使用此技术所需过程的一小部分。建立一个AG就是要为进行长期计划而付出工时成本，以换取长期稳定性。经过深思熟虑，AG提供的恢复时间比日志传送更快，维护远少于复制，并且不像FCI那样依赖共享存储，并且可以像数据库镜像那样对数据库组进行故障转移。

Use these questions and the issues addressed to augment your research and planning before implementing an Availability Group. Then you should do full tests in a non-production environment. This includes building out the full cluster and having some nodes at your disaster recovery site with fail-over tests under load.

在实施可用性组之前，请使用这些问题和所解决的问题来扩大您的研究和计划。然后，您应该在非生产环境中进行全面测试。这包括构建整个群集，并在灾难恢复站点上使某些节点具有负载下的故障转移测试。