记录SQL Server数据库错误

We receive many database alerts with many of the alerts logging some of these same alerts or information to files or tables. What we’ve found over time is that the logging is now costing us quite a bit of resource. Our logging server (where both files and table logging are stored) has had a few outages related to conflicts from messages for other servers. We’ve considered scaling the alerting by environment type, but we’ve also considered that we may be logging too much information about our databases. In addition, since we receive so many alerts each day, it’s impossible for us to resolve them and assist with other issues that arise. What are some techniques that we can use to help us with the issue of too much logging information and too many alerts?

我们收到许多数据库警报,其中许多警报将其中一些相同的警报或信息记录到文件或表中。 随着时间的推移,我们发现日志记录现在使我们花费了大量资源。 我们的日志记录服务器(同时存储文件和表日志记录)已发生了一些中断,与其他服务器的消息冲突有关。 我们考虑过按环境类型扩展警报,但也考虑过可能记录了太多有关数据库的信息。 此外,由于我们每天都会收到大量警报,因此我们无法解决它们并协助解决其他问题。 我们可以使用哪些技术来帮助我们解决过多的日志记录信息和过多的警报问题?

总览 (Overview)

We often coordinate alerting with logging errors, where some errors become alerts. While we have situations where we want a detailed log of what happened, especially with some errors or application processes, we can over log information or retain information well after we need it and face an issue where our logging uses many resources. In a few cases, we may have a next step we can immediately run or try before logging if an event occurs. We’ll look at some scenarios about when intensive logging is useful and scenarios where we can reduce or eliminate it.

我们通常将警报与日志记录错误进行协调,其中某些错误会变成警报。 在某些情况下,我们需要详细记录发生的事情,尤其是某些错误或应用程序流程时,我们可以在需要时覆盖日志信息或很好地保留信息,并面临日志使用大量资源的问题。 在某些情况下,我们可能会有下一步,我们可以立即运行或尝试记录事件是否发生。 我们将研究一些有关密集型日志记录何时有用的方案以及可以减少或消除它的方案。

记录的消息需要多个步骤来解决 (Logged messages require multiple steps to resolve)

Suppose that we run a delete script, which removed 1 million rows of data in a database. This delete signals one of our applications to remove further related data in other databases, as our initial database is a starting point for our applications. Eventually, over 25 million rows of data are removed along with other records in other databases being updated for various applications, but we find an error: only 30% of the 1 million rows that were removed needed to be removed. This means that we have a significant amount of deletes to revert along with re-updating related information.

假设我们运行一个删除脚本,该脚本删除了数据库中的一百万行数据。 此删除表示我们的应用程序之一将删除其他数据库中的其他相关数据,因为我们的初始数据库是我们应用程序的起点。 最终,超过2500万行数据以及其他数据库中针对其他应用程序更新的其他记录被删除,但是我们发现了一个错误:被删除的100万行中只有30%需要删除。 这意味着我们要还原大量删除内容,同时还需要重新更新相关信息。

When we consider how we would log a delete where multiple databases are impacted, we would use intensive logging that would cost many resources; for an example:

当我们考虑如何记录影响多个数据库的删除时,我们将使用密集的日志记录,这将消耗大量资源; 例如:

  • Log the deletes of the records in first database

    记录删除第一个数据库中的记录
  • Log the deletes in all further databases, that use the first database as a resource

    将删除记录记录在所有其他数据库中,这些数据库将第一个数据库用作资源
  • Log the updates to all related database records

    将更新记录到所有相关的数据库记录

We could use change data tracking or queue tables to keep this information (the below code shows an example of using a queue table for this), but we have to keep at least one logging table of the delete operation so that further transactions can execute this same delete. The same would be true for an insert or update if we have further tables that must also be updated or added to with new or changed data.

我们可以使用变更数据跟踪或队列表来保留此信息(以下代码显示了为此使用队列表的示例),但是我们必须保留至少一个删除操作的日志记录表,以便其他事务可以执行此操作相同的删除。 如果我们还有其他表也必须使用新数据或更改后的数据进行更新或添加,则对于插入或更新也是如此。

---- Save our transaction prior to the transaction
INSERT INTO tbQueue
SELECT 
	ID --- id of values being removed
	, 'D' --- operation
	, GETDATE() --- time
	, 0 --- processed yet in further tables
FROM tbReport
WHERE ID BETWEEN 13 AND 41
 
---- Execute our transaction
DELETE
FROM tbReport
WHERE ID BETWEEN 13 AND 41 
 
---- Using our queue table to execute removals or updates in tables dependent on our tbReport
DELETE
FROM OtherDatabase.dbo.DependentTableOne
WHERE ID IN (SELECT ID FROM tbQueue WHERE Processed = 0)
 
DELETE
FROM OtherDatabase.dbo.DependentTableTwo
WHERE ID IN (SELECT ID FROM tbQueue WHERE Processed = 0)


This carries significant costs to log (not just in space, but we must design this in a manner in which we lose none of these records, or else we can’t revert a change), but without this, how would we accurately revert the original removal? Even if we did a multi-table removal, update or insert where we passed in one removal, update or insert at a time for all tables, upon reversion, we’d still need logging that allowed us to revert the change. A restore database operation is not always an option if a change creates further changes in other databases or tables.

这会产生大量的日志记录成本(不仅是在空间上,而且我们必须以不丢失所有这些记录的方式进行设计,否则我们将无法还原更改),但是如果没有这样做,我们将如何准确地还原日志记录?原始移除? 即使我们进行了多表删除,更新或插入,对于所有表我们一次传递了一次删除,更新或插入,但在还原后,我们仍然需要允许我们还原更改的日志记录。 如果更改在其他数据库或表中产生了进一步的更改,则还原数据库操作并不总是一种选择。

In these situations, we must keep extensive logs of information and I suggest that our logging tables use data compression techniques, such as row or page level compression or column store compression, which can reduce disk usage. In the below images, we see the effects of column store compression for a table that retains logging information from the error log:

在这些情况下,我们必须保留大量的信息日志,我建议我们的日志表使用数据压缩技术,例如行或页面级压缩或列存储压缩,这可以减少磁盘使用。 在下图中,我们看到了保留来自错误日志的日志信息的表的列存储压缩的效果:

CREATE CLUSTERED COLUMNSTORE INDEX CCI_tbSavedErrorLog ON tbSavedErrorLog

记录的消息有两种解决方案 (Logged messages have a couple of solutions)

In some cases, we may get extensive log messages on an error or successful messages that may not need to be kept, especially over longer periods of time. For example, most of us frequently run CHECKDB against all our databases on a schedule. In some cases, because our environment may be limited in resources, we may only retain the final CHECKDB message if it passes – whether by saving the output or retaining it from the error log, or we may only retain the latest successful CHECKDB result (assuming it was successful). When we find an error, we want the full error, even with the objects that passed, as we may have multiple steps that we follow or we may want additional information even if we fix the error – what could have caused corruption in the first place is a question we may be asking and want to know more information.

在某些情况下,我们可能会收到关于错误的大量日志消息或可能不需要保留的成功消息,尤其是较长的时间。 例如,我们大多数人经常按计划对所有数据库运行CHECKDB。 在某些情况下,由于我们的环境可能会受到资源的限制,我们只能保留最终的CHECKDB消息(如果通过)–通过保存输出还是从错误日志中保留它,或者我们仅保留最新的成功CHECKDB结果(假设成功)。 当发现错误时,我们希望得到完整的错误,即使传递的对象也是如此,因为我们可能要执行多个步骤,或者即使解决了错误,我们也可能需要其他信息–最初可能导致损坏的原因是我们可能要问的一个问题,想了解更多信息。

When CHECKDB fails – we need to know why it failed and how to resolve it. In this case, logging becomes very useful. The fact that CHECKDB passed seven years ago for our database when we run the command every 8 hours is less important to know and something we don’t need to retain. Since failed events like this require several possible solutions along with related information, we want to keep extensive logging for these types of events. As for successful CHECKDB messages and related logged messages like this, DBAs can decide how much and how long they want to keep the successful messages they receive based on their experience of when they may need them or under what circumstances.

当CHECKDB失败时–我们需要知道为什么失败以及如何解决它。 在这种情况下,日志记录将变得非常有用。 每隔8小时运行一次命令时,CHECKDB在7年前就已为我们的数据库传递了这一事实,这一点并不重要,您无需知道,也无需保留。 由于类似这样的失败事件需要几种可能的解决方案以及相关信息,因此我们希望针对这些类型的事件保留大量的日志记录。 对于这样的成功CHECKDB消息和相关的已记录消息,DBA可以根据他们何时需要或在什么情况下的经验来决定他们希望保留收到的成功消息的数量和时间。

In these situations, we may want to keep failures, related information upon failures, a reduced message on success, or archive or remove old logging in a shorter period. These situations require less logging than the first case, except in some situations, and they may also not need to be retained for a long period of time.

在这些情况下,我们可能希望保留故障,有关故障的相关信息,减少有关成功的消息,或者在较短的时间内归档或删除旧的日志记录。 除某些情况外,这些情况所需的日志记录少于第一种情况,并且它们可能也不需要长时间保留。

记录的消息有一种解决方案或结果 (Logged messages have one solution or outcome)

In some cases, if an error occurs, we have one solution or outcome after the error, or we may not need logging at all, but an immediate resolution. Suppose that we have a configuration table that we update on deployments – adding new records, updating existing records, and removing records. We could log the success or failure of the events after the script. We could also validate the script as we run it so that we avoid this additional logging, or significantly reduce it.

在某些情况下,如果发生错误,那么在错误之后我们将有一个解决方案或结果,或者根本不需要记录,而是立即解决。 假设我们有一个配置表,可以在部署时对其进行更新-添加新记录,更新现有记录和删除记​​录。 我们可以在脚本之后记录事件的成功或失败。 我们还可以在运行脚本时对其进行验证,以便避免这种额外的日志记录或将其大幅减少。

In the below example, we add one record and update two records to a configuration table and we validate that 3 total rows were added or changed (the @count variable) and we test that the report location we added is valid in our report dump table, where we would keep track of these locations;

在下面的示例中,我们添加一条记录并向配置表中更新两条记录,并验证是否总共添加或更改了3行(@count变量),并测试添加的报告位置在报告转储表中是否有效,我们将在哪里跟踪这些位置;

DECLARE @count INT = 0, @rptcount INT
BEGIN TRAN
 
----
INSERT INTO tbConfigurables
VALUES ('D:\Files\fields.rpt',GETDATE(),GETDATE())
 
SELECT @count += @@ROWCOUNT
 
UPDATE tbConfigurables
SET LocFile = '<data>17</data>'
	, EditDate = GETDATE()
WHERE LocFile = '<data>12</data>'
 
SELECT @count += @@ROWCOUNT
 
SELECT @rptcount = COUNT(*) FROM tbReportLocations WHERE FileDump = 'D:\Files\fields.rpt'
 
---- Testing the changes:
IF ((@count = 3) AND (@rptcount >= 1))
BEGIN
	COMMIT TRAN
END
ELSE
BEGIN
	ROLLBACK TRAN
END

In some cases, we may not be able to perform this with values we change, as we may require application validation. An alternative route to this is to include a rollback script with every change. For example, to rollback the above script, we could add in a commented section below the script:

在某些情况下,由于我们可能需要应用程序验证,因此我们可能无法使用更改的值执行此操作。 另一种替代方法是在每次更改时都包含一个回滚脚本。 例如,要回滚以上脚本,我们可以在脚本下方添加一个注释部分:

/*
 
---- Roll back script
DELETE
FROM tbConfigurables
WHERE LocFile = 'D:\Files\fields.rpt'
 
UPDATE tbConfigurables
SET LocFile = '<data>12</data>'
	, EditDate = GETDATE()
WHERE LocFile = '<data>17</data>'
 
*/

The result of this is that we don’t end up with extra information that doesn’t help us, or requires that we make more decisions when we only have one decision to make – the script and (or) validation of the script failed, which means the script needs to be rolled back. This approach holds true for some events that require one or two solutions when they occur, such as an event that always requires a restart or an event that always requires a cache flush, etc. We may want to only log what was executed and whether it passed, or alert with the immediate next step and avoid information that uses space for no reason.

这样的结果是,我们最终不会获得对我们没有帮助的额外信息,或者在只有一个决定(脚本和(或)脚本验证失败,这意味着脚本需要回滚。 对于某些在发生时需要一种或两种解决方案的事件,例如总是需要重启的事件或总是需要缓存刷新的事件等,这种方法仍然适用。我们可能只想记录执行了什么以及是否执行了记录已通过,或立即进行下一步警报,并避免无故使用空间信息。

In these situations, we want to solve as much as we can within the transaction or if we only have one action to take after a message. Logging the summary may be useful, but logging each step if on error we do one thing consumes unnecessary resources.

在这种情况下,我们希望在交易中解决尽可能多的问题,或者在收到消息后只采取一项行动。 记录摘要可能很有用,但是如果出错则记录每一步,这会消耗不必要的资源。

也可以看看 (See also)

Interested in reading the SQL Server transaction log? Consider ApexSQL Log, a 3rd party SQL Server transaction log reader, that renders transaction log information into a searchable, sortable grid and can produce Undo and Redo scripts.

有兴趣阅读SQL Server事务日志吗? 考虑ApexSQL Log,它是第三方SQL Server事务日志读取器 ,可将事务日志信息呈现到可搜索,可排序的网格中,并可以生成撤消和重做脚本。

翻译自: https://www.sqlshack.com/logging-sql-server-database-errors/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值