在线修复Alwayson的重做日至过大

  1. 察看AG group的面板,察看AG组是否正常,如果面板显示正常执行2的步骤,面板显示异常的请根据异常提示来执行操作,本篇内容不涉及处理这些异常
  2. 执行以下语句

点击(此处)折叠或打开

  1. select n.group_name,    
  2. n.replica_server_name,    
  3. n.node_name,    
  4. rs.role_desc,
  5. CASE WHEN rs.is_local = 1 THEN N'LOCAL' ELSE 'REMOTE' END as is_local,    
  6. DB_NAME(drs.database_id) as 'database_name',    
  7. rs.connected_state_desc,    
  8. CASE WHEN drs.is_suspended=0 THEN 'RESUMED' ELSE
  9. CASE WHEN drs.suspend_reason_desc='SUSPEND_FROM_USER' THEN '用户手动挂起数据移动'    
  10. WHEN drs.suspend_reason_desc='SUSPEND_FROM_PARTNER' THEN '在强制故障转移后挂起数据库副本'    
  11. WHEN drs.suspend_reason_desc='SUSPEND_FROM_REDO' THEN '在重做阶段中出错'    
  12. WHEN drs.suspend_reason_desc='SUSPEND_FROM_APPLY' THEN '在将日志写入文件时出错(请参阅错误日志)'    
  13. WHEN drs.suspend_reason_desc='SUSPEND_FROM_CAPTURE' THEN '在捕获主副本上的日志时出错'    
  14. WHEN drs.suspend_reason_desc='SUSPEND_FROM_RESTART' THEN '在重新启动数据库前挂起数据库副本(请参阅错误日志)'    
  15. WHEN drs.suspend_reason_desc='SUSPEND_FROM_UNDO' THEN '在撤消阶段中出错(请参阅错误日志)'    
  16. WHEN drs.suspend_reason_desc='SUSPEND_FROM_REVALIDATION' THEN '在重新连接时检测到了日志更改不匹配(请参阅错误日志)'    
  17. WHEN drs.suspend_reason_desc='SUSPEND_FROM_XRF_UPDATE' THEN '找不到公共日志点(请参阅错误日志)'    
  18. ELSE '' END
  19. END as is_suspended, --是否数据库挂起及原因
  20. drs.synchronization_state_desc,    
  21. drs.synchronization_health_desc,    
  22. ISNULL(drs.log_send_queue_size,0) as log_send_queue_size, --主数据库中尚未发送到辅助数据库的日志记录量 (KB)    
  23. ISNULL(drs.redo_queue_size,0) as redo_queue_size, --辅助副本的日志文件中尚未重做的日志记录量 (KB)    
  24. CASE WHEN ISNULL(drs.log_send_rate,0)=0 THEN 0
  25. ELSE CAST(ISNULL(drs.log_send_queue_size,0)*1./ISNULL(drs.log_send_rate,0) AS DECIMAL(18,2))    
  26. END as log_send_need_time, --主数据库中日志记录发送完成需要时间(秒)    
  27. CASE WHEN ISNULL(drs.redo_rate,0)=0 THEN 0
  28. ELSE CAST(ISNULL(drs.redo_queue_size,0)*1./ISNULL(drs.redo_rate,0) AS DECIMAL(18,2))
  29. END as redo_need_time --辅助副本中日志记录重做完成需要时间(秒)    
  30. from sys.dm_hadr_availability_replica_cluster_nodes n
  31. join sys.dm_hadr_availability_replica_cluster_states cs
  32. on n.replica_server_name = cs.replica_server_name
  33. join sys.dm_hadr_availability_replica_states rs
  34. on rs.replica_id = cs.replica_id
  35. join sys.dm_hadr_database_replica_states drs
  36. on rs.replica_id=drs.replica_id
  37. where rs.connected_state_desc<>'CONNECTED'    
  38. or drs.synchronization_health_desc<>'HEALTHY'    
  39. or drs.is_suspended<>0    
  40. or CASE WHEN ISNULL(drs.log_send_rate,0)=0 THEN 0 ELSE CAST(ISNULL(drs.log_send_queue_size,0)*1./ISNULL(drs.log_send_rate,0) AS DECIMAL(18,2)) END>100
确认
connected_state_desc=CONNECTED
is_suspended=RESUMED
synchronization_state_desc=SYNCHRONIZING
synchronization_health_desc=HEALTHY

这几个表示
AG组运行正常,同步也在继续,即可以排除网络和磁盘读写的问题
一般这些问题排除后可以定位为人为操作造成的

3. 用以下语句察看当前blocked sql

点击(此处)折叠或打开

  1. ;with cte as (select replace(hostname,' ','') as hostname ,''''+replace(program_name,' ','')+'''' as program_name
  2. , loginame, db_name(a.dbid) AS DBname,spid,blocked,waittime/1000 as waittime,a.status,a.lastwaittype,a.cmd
  3. ,Replace(substring(b.text,1,340),'''','''') as sqlmessage,cpu

  4. from sys.sysprocesses as a with(nolock)
  5. cross apply sys.dm_exec_sql_text(sql_handle) as b
  6. where a.blocked>0 and sql_handle<>0x0000000000000000000000000000000000000000
  7. and waittime>2000 )
  8. select replace(hostname,' ','') as hostname ,''''+replace(program_name,' ','')+'''' as program_name
  9. , loginame, db_name(a.dbid) AS DBname,spid,blocked,waittime/1000 as waittime,a.status,a.lastwaittype,a.cmd
  10. ,Replace(substring(b.text,1,340),'''','''') as sqlmessage,cpu

  11. from sys.sysprocesses as a with(nolock)
  12. cross apply sys.dm_exec_sql_text(sql_handle) as b
  13. where exists(select blocked from cte where cte.blocked=a.spid)
  14. and not exists (select spid from cte where cte.spid=a.spid)
  15. union all
  16. select * from cte
4.一般来讲我们会看到用户大量的对数据库进行的操作,大致观察一下,和用户沟通后kill掉堵塞的进程。
用脚本可以直接kill所有进程,供参考,

点击(此处)折叠或打开

  1. USE master;
  2. GO
  3. DECLARE @SQL VARCHAR(MAX);
  4. declare @i int;
  5. set @i=3600 --大小可以先设置为redo_need_time(s)
  6. SET @SQL=''
  7. while @i>0
  8. begin

  9. SELECT @SQL=@SQL+'; KILL '+RTRIM(SPID)
  10. FROM master..sysprocesses
  11. WHERE dbid=DB_ID('datayesdb');
  12. EXEC(@SQL);
  13. set @i=@i-1
  14. print @i
  15. end
  16. waitfor delay '0:0:1'
请注意该脚本只有在alwayson 两边延迟非常严重时才可以使用,一般redo_need_time(s) >3600可以酌情考虑
5.打开另一查询窗口用2的语句时时监控alwayson的状态
一般来讲,以下几个值都会慢慢减小,如果还在增大可能需要继续判断是否是其他问题,此篇不做讨论
log_send_queue_size  
redo_queue_size      
log_send_need_time                      
redo_need_time
6.当AG恢复正常,需要去重新启用之前被kill的job

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16131092/viewspace-2138293/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/16131092/viewspace-2138293/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值