了解SQL Server数据管理生命周期

最新推荐文章于 2023-02-27 18:24:59 发布

culuo4781

最新推荐文章于 2023-02-27 18:24:59 发布

阅读量501

点赞数

文章标签：数据库 python java linux 大数据

原文链接：https://www.sqlshack.com/understanding-the-data-management-life-cycle/

版权

This is the second article in the “Backup and Restore (or Recovery) in SQL Server” stairway series (see the full TOC below). This article deals with the different phases of data management life cycle and it encompasses the following topics:

这是“ SQL Server的备份和还原（或恢复）”阶梯系列中的第二篇文章（请参见下面的完整目录）。本文介绍了数据管理生命周期的不同阶段，其中包括以下主题：

Introduction to data corruption
数据损坏简介
Defining data corruption and its causes
定义数据损坏及其原因
Discussion on the impact of data corruption
讨论数据损坏的影响
Explaining data prevention mechanisms
解释数据预防机制
Data protection
数据保护
And more…
和更多…

数据损坏简介 (Introduction to Data Corruption)

Data is said to be corrupted when it attains an undesirable state different from the original data. Data corruption also refers to data errors that can occur during any of the stages of data writing, data processing or even reading, at the storage, or during transmission.

当数据达到与原始数据不同的不良状态时，就称该数据已损坏。数据损坏还指在存储或传输过程中，在数据写入，数据处理甚至读取的任何阶段都可能发生的数据错误。

We all know that at the very core, data is nothing but a series of 1s and 0s. When data is written, what actually happens is that the 1s and 0s are written to the destination. This sequence of 1s and 0s is interpreted during the data reading operation, giving it some meaning. If these bits don’t get written in the intended sequence, the data is said to have been corrupted.

我们都知道，从根本上讲，数据只不过是一系列的1和0。写入数据时，实际发生的是将1和0写入目标。在数据读取操作期间会解释1和0的顺序，使其具有一定的含义。如果未按预期顺序写入这些位，则表示数据已损坏。

Small data means smaller sequences, and relatively lower chances of corruption. When the amount of data increases, the chances of corruption also increases. The chances are particularly high when data generation is high, and the subsequent storage operations hit unprecedented rates.

小数据意味着较小的序列，并且损坏的机会相对较低。当数据量增加时，损坏的可能性也会增加。当数据生成量很高时，机会特别高，随后的存储操作达到空前的速度。

Some of the most common reasons for data corruption include hardware or software issues with regard to I/O subsystems.

数据损坏的一些最常见原因包括与I / O子系统有关的硬件或软件问题。

数据损坏的原因 (Causes of Data Corruption)

Hardware issues or failures
硬件问题或故障
Memory issues
记忆问题
Power failure or outages
停电或停电
- SAN controllers
  SAN控制器
- RAID controllers
  RAID控制器
- Disk drivers
  磁盘驱动器
- Bad sectors on the disks
  磁盘上的坏扇区
Operating System errors
操作系统错误
Virus attacks
病毒攻击
Antivirus, defraggers, even data encryption,
杀毒软件，碎片整理程序甚至数据加密
SQL Server bugs
SQL Server错误
Human errors
人为错误
Improper shutdown
关闭不当
Hard reset
硬重置

腐败的影响 (Effects of corruption)

In general, SQL Server isn’t aware of data corruption until it comes across errors when accessing a resource that’s corrupt. Therefore, the issue is that if the data wasn’t backed up before the corruption occurred, there would be no way to retrieve the data, which would mean permanent data loss.

通常，SQL Server在访问损坏的资源时遇到错误之前，不会意识到数据损坏。因此，问题在于，如果未在损坏发生之前备份数据，则将无法检索数据，这将意味着永久性数据丢失。

Let’s consider a scenario where a mission-critical application database encountered data corruption. Let us remember that SQL Server doesn’t notice corrupt data until it tries to access it. Suppose a certain chunk of data was not read for a few weeks, because there was no necessity to, and data backup retention period was over resulting in the backup being purged. And then one day, after the backup containing the data was purged, someone queried the data, and SQL Server reported that it was corrupt. The value of the data chunk and the impact of the loss to the business depend on the nature of that data. Had the issue been detected within the span when the data could’ve been restored, the process may have even been considered insignificant. Now, however, this may be a loss to the business or a loss of job for the database administrator. Performing required data consistency checks on the database at regular intervals is a critical responsibility of a database administrator.

让我们考虑一个关键任务应用程序数据库遇到数据损坏的情况。让我们记住，SQL Server在尝试访问数据之前不会注意到损坏的数据。假设几周内没有读取某些数据，因为没有必要，而且数据备份保留期已过，导致清除了备份。然后一天，清除包含数据的备份后，有人查询了数据，SQL Server报告该数据已损坏。数据块的价值以及损失对业务的影响取决于数据的性质。如果在可以恢复数据的时间范围内检测到问题，则该过程甚至可能被认为无关紧要。但是现在，这可能是业务损失或数据库管理员的工作损失。定期对数据库执行所需的数据一致性检查是数据库管理员的一项重要职责。

预防腐败 (Preventing corruption)

SQL Server has several built-in commands to mitigate the effects of corruption on the data during data operations.

SQL Server具有多个内置命令，以减轻数据操作期间数据损坏的影响。

监视数据库中是否有损坏迹象 (Monitor the databases for signs of corruptions)

One of the ways to look for data corruption is by enabling the CHECKSUM option for page verification.

查找数据损坏的方法之一是通过启用CHECKSUM选项进行页面验证。

  ALTER DATABASE ProdSQLShackDemo SET PAGE_VERIFY CHECKSUM

This automatically checks the integrity of the data even as it is read from the table. All page verification operations are performed in the buffer pool.

即使从表中读取数据，这也会自动检查数据的完整性。所有页面验证操作均在缓冲池中执行。

实施备份策略 (Implementing a backup strategy)

If the data is stored over an extended period of time, then CHECKSUM is not a good-enough option to validate the data. In those cases, we need to predominantly depend on the database consistency checker commands.

如果数据存储时间较长，则CHECKSUM不是验证数据的理想选择。在这些情况下，我们主要需要依赖数据库一致性检查器命令。

Early detection is the key for successful data management. The sooner we notice issues, the easier it would be to manage these issues and the easier would it be to control the situation.

早期检测是成功进行数据管理的关键。我们越早发现问题，就越容易管理这些问题，就越容易控制局势。

Always plan for a good backup strategy. A good backup strategy is always helpful, and helps recover from any disaster. Not only does it guarantee smooth functioning of the business, but also protects the data against user-caused errors, along with other types of failures.

始终计划好的备份策略。一个好的备份策略总是有帮助的，并有助于从任何灾难中恢复。它不仅可以保证业务的平稳运行，而且可以保护数据免受用户引起的错误以及其他类型的故障的影响。

有选择 (Having options)

It’s best to never place all of your eggs in the same basket. That is, it’s always better to have multiple options of recovery—as many as possible. However, no matter how many options we have, they’ll be helpful only when the issue is detected at an early stage. If not, the corruption may become too expensive to handle.

最好不要将所有鸡蛋放在同一个篮子里。就是说，有尽可能多的恢复选项总是更好。但是，无论我们有多少选择，它们只有在早期发现问题时才会有用。如果不是这样，腐败可能变得太昂贵而无法处理。

SQL Server has a number of built-in tools to check the structure of the databases and their internals using DBCC commands that provide useful information to troubleshoot corruption issues. These commands can be divided into four categories:

SQL Server具有许多内置工具，它们使用DBCC命令检查数据库的结构及其内部，这些命令提供了有用的信息来解决损坏问题。这些命令可以分为四类：

Informational commands
信息命令
Validation commands
验证命令
Maintenance commands
维护命令
Miscellaneous commands
杂项命令

The most comprehensive check for data corruption can be done using the DBCC CHECKDB command.

可以使用DBCC CHECKDB命令对数据损坏进行最全面的检查。

It runs DBCC CHECKALLOC to check the consistency of disk space allocation structures, runs CHECKTABLE to check the table structure and pages for corruption, and then, runs CHECKCAALOG to validate the catalog consistency.
它运行DBCC CHECKALLOC来检查磁盘空间分配结构的一致性，运行CHECKTABLE来检查表结构和页面是否损坏，然后运行CHECKCAALOG来验证目录的一致性。
It validates every index view
它验证每个索引视图
It validates the links between the files on the disk, and in the metadata store of the table for filestream-enabled databases.
它为启用文件流的数据库验证磁盘上文件以及表的元数据存储中文件之间的链接。
It validates the service broker data.
它验证服务代理数据。

We have three options for the DBCC CHECKDB command:

对于DBCC CHECKDB命令，我们有三个选项：

Repair allow data loss, which deallocates pages and wipes out the data 修复会导致数据丢失，从而导致页面分配不足并擦除数据
Repair Fast, which is used for backward compatibility 快速修复 ，用于向后兼容
Repair Build, which yields to no data loss. It performs quick repairs on missing rows and non-clustered indexes. 修复Build ，不会丢失任何数据。它对丢失的行和非聚集索引执行快速修复。

It’s interesting to know that the DBCC commands actually create an internal snapshot of the database, and then performs checks against that snapshot. This prevents concurrency issues. There are a few instances where the snapshot isn’t created, as well, for example, in a read-only database or a database in single-user mode or using hints.

有趣的是，DBCC命令实际上创建了数据库的内部快照，然后针对该快照执行检查。这样可以防止并发问题。在某些情况下，也不会创建快照，例如，在只读数据库或单用户模式或使用提示的数据库中。

On a VLDB database, CHECKDB command may take a while to complete. If the database design uses multiple file groups, we can reduce the load by initiating the CHECKDB process on individual file groups. It can also be scheduled to run at a larger interval, such as alternate days in place of every day.

在VLDB数据库上，CHECKDB命令可能需要一段时间才能完成。如果数据库设计使用多个文件组，则可以通过在单个文件组上启动CHECKDB进程来减少负载。还可以安排它以较大的间隔运行，例如以隔天代替每天。

Let’s now dive into some examples to understand the DBCC concepts and syntax:

现在，让我们深入研究一些示例以了解DBCC概念和语法：

-- Chnge the databases 
USE ProdSQLShackDemo;
GO
-- run DBCC command
DBCC CHECKDB;
GO
 
-- The CHECKDB runs the following three console commands
DBCC CHECKALLOC;

DBCC CHECKTABLE ('SQLShackAuthor') WITH PHYSICAL_ONLY; 
DBCC CHECKCATALOG WITH NO_INFOMSGS
DBCC CHECKCATALOG (ProdSQLShackDemo);
-- Checktable can be further simplified for specific index
DECLARE @indid int;    
SET @indid = (SELECT index_id   
              FROM sys.indexes    
              WHERE object_id = OBJECT_ID('SQLShackAuthor') AND name like '%PK__SQLShack__3214EC27014935CB%')    
DBCC CHECKTABLE ('SQLShackAuthor',@indid);

-- If corruptions are identified the  repair options are available
-- database must be in single user mode
DBCC CHECKDB ('ProdSQLShackDemo', REPAIR_ALLOW_DATA_LOSS); -- REPAIR_FAST or REPAIR_REBUILD
-- CHECKFILEGROUP for specific filegroup
USE ProdSQLShackDemo;
GO
DBCC CHECKFILEGROUP (0, NOINDEX)
    WITH PHYSICAL_ONLY,
         ESTIMATEONLY;
GO

With the CHECKFILEGROUP command, we need to specify the name of the file group or its filegroupID. In this example, 0 is used, which represents primary filegroups. The NOINDEX option ignores the non-clustered index and runs only for clustered indexes and heap. The physical_only option validates the structure of the data pages, page headers and structure of the files on the disk. The estimate_only option figures out how much space is required in tempdb to create the snapshot.

使用CHECKFILEGROUP命令，我们需要指定文件组的名称或其文件组ID。在此示例中，使用0表示主要文件组。 NOINDEX选项将忽略非聚集索引，并且仅对聚集索引和堆运行。 physical_only选项将验证数据页的结构，页头以及磁盘上文件的结构。 estimate_only选项指出在tempdb中创建快照需要多少空间。

保护数据库 (Protecting the database)

Database protection is the process of safeguarding the data from corruption. We are in the era of little-to-no tolerance for outages and downtime. Hence, data protection is the key for many successful organizations.

数据库保护是保护数据免受损坏的过程。我们正处于对故障和停机几乎没有容忍的时代。因此，数据保护是许多成功组织的关键。

Data protection can be classified further into Availability and Management

数据保护可以进一步分为可用性和管理

Data availability is a process to ensure to smoothly run the business without any interruptions or data loss.

数据可用性是确保业务平稳运行而不会造成任何中断或数据丢失的过程。

Data management is a process used to understand the nature of the data, govern the data in such a way that the data is available at any cost irrespective of any forms corruption and it’s agreed with the framework of business policy and government bodies. Here are a few points to keep in mind:

数据管理是用于理解数据性质，以某种方式管理数据的过程，以使数据能够以任何成本获得，而不管其是否存在任何形式的损坏，并且已与商业政策和政府机构框架达成一致。请记住以下几点：

Detect issues early
尽早发现问题
Have a good backup strategy
有一个好的备份策略
Run DBCC commands frequently
经常运行DBCC命令

SQL Server has several built-in mechanisms called page protection options to automatically identify data corruption.

SQL Server具有几种内置的机制，称为页面保护选项，可以自动识别数据损坏。

TORN_PAGE_DETECTION allows SQL Server to identify incomplete writes.
TORN_PAGE_DETECTION允许SQL Server识别不完整的写入。
CHECKSUM uses negligible CPU and is an error-detection mechanism, however, is not a protection mechanism. It does a comprehensive evaluation of the data pages and is the recommended option to use while deciding the data protection and backup protection process.
CHECKSUM使用的CPU可以忽略不计，它是一种错误检测机制，但不是保护机制。它对数据页进行了全面评估，是在确定数据保护和备份保护过程时推荐使用的选项。

摘要 (Summary)

It’s important for any database administrator to understand the data lifecycle and the nature of the particular business, in order to have the ability to recover artifacts of business value from any sort of data disruptions.

任何数据库管理员都必须了解数据生命周期和特定业务的性质，以便能够从任何类型的数据中断中恢复业务价值工件，这一点很重要。

There are a number of ways in which corruption can creep into the system, and what makes it worse is its random nature, on when and where it occurs. In other words, corruption can never be eliminated from the systems. The administrators should perform proactive monitoring and guard the data against undesirable but inevitable issues of corruption. If corruption gets more frequent, then you should start working on a migration plan to move the database to a new hardware.

腐败可以通过多种方式蔓延到系统中，更糟糕的是它的随机性，发生时间和地点。换句话说，腐败永远无法从系统中消除。管理员应执行主动监视，并保护数据免受不良但不可避免的损坏问题的影响。如果损坏变得更加频繁，那么您应该开始制定迁移计划，以将数据库移至新硬件。

Educate yourself—As the preceding points have noted, many causes of data corruption and loss can be avoided by taking appropriate measures to protect your data and avoid the common causes of data corruption.

自我教育—正如前面提到的那样，可以通过采取适当措施保护您的数据并避免造成数据损坏的常见原因来避免造成数据损坏和丢失的许多原因。

Implement a comprehensive Backup and Restore/Recovery solution and test it regularly to ensure everything is squeaky clean.

实施全面的备份和还原/恢复解决方案并定期对其进行测试，以确保所有内容都干净整洁。

Repair options should be considered as the last resort. It is advisable to make sure that you’re working on a cloned copy of the database; that way you can investigate what data, if any, was actually lost or modified.

维修选择应被视为万不得已。建议确保您正在处理数据库的克隆副本。这样，您可以调查实际上丢失或修改了哪些数据（如果有）。

Understand the scope of the business before tailoring the high availability feature set for the enterprise. No matter what high availability features we enable, ultimately, in most of the cases, recovering from a corruption relies on having a good backup strategy and routine monitoring.

在为企业定制高可用性功能集之前，请先了解业务范围。无论我们启用哪种高可用性功能，最终，在大多数情况下，从损坏中恢复都取决于拥有良好的备份策略和常规监视。