解决dba_outstanding_alerts误报表空间使用率的问题

前段时间通过dba_outstanding_alerts发现表空间增长超过了85%,然后加了个裸设备做为数据文件,虽然这个数据文件可用,但dba_outstanding_alerts中的警告信息并没有消失:

SQL> col reason for a50
SQL> SELECT REASON
2  , METRIC_VALUE
3  , TO_CHAR(CREATION_TIME,'DD-MON-YYYY HH24:MI:SS') cdate
4  FROM SYS.DBA_OUTSTANDING_ALERTS;

REASON                                             METRIC_VALUE CDATE
-------------------------------------------------- ------------ -----------------------
Tablespace [ERP_INDEX] is [94 percent] full          94.5023148 26-MAR-2010 19:45:05

可见Oracle在这里认为这个表空间已经使用了94%以上,查看一下预警阀值的设定:

SQL> col METRICS_NAME for a50

SQL> col WARN_VAL for a10
SQL> col CRIT_VAL for a10
SQL> col OBJ_TYPE for a20
SQL> SELECT METRICS_NAME
2  , WARNING_VALUE WARN_VAL
3  , CRITICAL_VALUE CRIT_VAL
4  , OBJECT_TYPE OBJ_TYPE
5  FROM SYS.DBA_THRESHOLDS
6  WHERE metrics_name LIKE '%Tablespace%';

METRICS_NAME                   WARN_VAL   CRIT_VAL   OBJ_TYPE
------------------------------ ---------- ---------- --------------------
Tablespace Bytes Space Usage   0          0          TABLESPACE
Tablespace Space Usage         85         97         TABLESPACE

超过85%就提示,似乎也是顺利成章的,可是实际表空间到底占用了多少呢:

SQL> select sum(bytes/1024/1024) from dba_data_files
2  where TABLESPACE_NAME='ERP_INDEX';

SUM(BYTES/1024/1024)
--------------------
31744

SQL> select sum(bytes/1024/1024) from sys.dba_free_space where tablespace_name='ERP_INDEX'
2  /

SUM(BYTES/1024/1024)
--------------------
3936

SQL>
SQL> select (31744-3936)/31744 from dual;

(31744-3936)/31744
------------------
.876008065

可见表空间实际的使用率是87%左右,那么为什么dba_outstanding_alerts中会提示是使用了94%呢?从DBA_TABLESPACE_USAGE_METRICS这个oracle 10g新增的未publish的视图来看,使用率也是94%:

SQL> SELECT TABLESPACE_NAME TBSP_NAME
2  , USED_SPACE*8/1024
3  , TABLESPACE_SIZE*8/1024 TBSP_SIZE
4  , USED_PERCENT
5  FROM SYS.DBA_TABLESPACE_USAGE_METRICS where TABLESPACE_NAME='ERP_INDEX';

TBSP_NAME                      USED_SPACE*8/1024  TBSP_SIZE USED_PERCENT
------------------------------ ----------------- ---------- ------------
ERP_INDEX                                  30016      31744   94.5564516

而这里显示为94%的原因是因为它统计了recyclebin里占用的空间:

SQL> connect banping/banping
Connected.
SQL> select sum(space*8/1024) from user_recyclebin where ts_name='ERP_INDEX';

SUM(SPACE*8/1024)
-----------------
2240

SQL> select (31744-3936+2240)/31744 from dual;

(31744-3936+2240)/31744
-----------------------
.946572581

可见,实际使用空间加上recyclebin里的这个表空间的对象大小正好是94%左右。那么purge了recyclebin后,是不是就解决了问题呢?

SQL> purge recyclebin;

Done

SQL> SELECT TABLESPACE_NAME TBSP_NAME
2  , USED_SPACE*8/1024
3  , TABLESPACE_SIZE*8/1024 TBSP_SIZE
4  , USED_PERCENT
5  FROM SYS.DBA_TABLESPACE_USAGE_METRICS where TABLESPACE_NAME='ERP_INDEX';

TBSP_NAME                      USED_SPACE*8/1024  TBSP_SIZE USED_PERCENT
------------------------------ ----------------- ---------- ------------
ERP_INDEX                                  27776      31744   87.5

SQL> col reason for a50
SQL> SELECT REASON
2  , METRIC_VALUE
3  , sequence_id,reason_id
4  FROM SYS.DBA_OUTSTANDING_ALERTS;

REASON                                             METRIC_VALUE  sequence_id reason_id
-------------------------------------------------- ------------ --------------- ----------
Tablespace [ERP_INDEX] is [94 percent] full          94.5023148  566360 9

可见这样操作能够消除DBA_TABLESPACE_USAGE_METRICS视图的错误统计,而dba_outstanding_alerts中的警告却依然如故。尝试看一下dba_outstanding_alerts的基表:

CREATE OR REPLACE VIEW SYS.DBA_OUTSTANDING_ALERTS AS
SELECT sequence_id,
reason_id,
owner,
object_name,
subobject_name,
typnam_keltosd AS object_type,
dbms_server_alert.expand_message(userenv('LANGUAGE'),
mid_keltsd,
reason_argument_1,
reason_argument_2,
reason_argument_3,
reason_argument_4,
reason_argument_5) AS reason,
time_suggested,
creation_time,
dbms_server_alert.expand_message(userenv('LANGUAGE'),
amid_keltsd,
action_argument_1,
action_argument_2,
action_argument_3,
action_argument_4,
action_argument_5)
AS suggested_action,
advisor_name,
metric_value,
decode(message_level, 32, 'Notification', 'Warning')
AS message_type,
nam_keltgsd AS message_group,
message_level,
hosting_client_id,
mdid_keltsd AS module_id,
process_id,
host_id,
host_nw_addr,
instance_name,
instance_number,
user_id,
execution_context_id,
error_instance_id
FROM wri$_alert_outstanding, X$KELTSD, X$KELTOSD, X$KELTGSD,
dba_advisor_definitions
WHERE reason_id = rid_keltsd
AND otyp_keltsd = typid_keltosd
AND grp_keltsd = id_keltgsd
AND aid_keltsd = advisor_id(+)

信息来自wri$_alert_outstanding表,和X$KELTSD, X$KELTOSD, X$KELTGSD等底层表都有关联,担心直接删掉的话会出问题,于是尝试修改下阀值:

SQL> exec DBMS_SERVER_ALERT.SET_THRESHOLD(
metrics_id=>DBMS_SERVER_ALERT.TABLESPACE_PCT_FULL,
warning_operator=>DBMS_SERVER_ALERT.OPERATOR_GE,
warning_value=>95,
critical_operator=>DBMS_SERVER_ALERT.OPERATOR_GE,
critical_value=>97,
observation_period=>1,
consecutive_occurrences=>1,
instance_name=>NULL,
object_type=>DBMS_SERVER_ALERT.OBJECT_TYPE_TABLESPACE,
object_name=>NULL);

PL/SQL procedure successfully completed

因为告警信息消除后,会从dba_outstanding_alerts转移到DBA_ALERT_HISTORY视图,查看这两个视图,果然发现了信息已经转移到了DBA_ALERT_HISTORY,而且提示的空间使用率已经变成了正确的87%,看来在这个过程中oracle会去某个地方重新取一次这个信息:

select * from dba_alert_history where sequence_id=566360 and reason_id=9

SQL> SELECT REASON
2  , METRIC_VALUE
3  FROM SYS.DBA_ALERT_HISTORY where sequence_id=566360 and reason_id=9;

REASON                                             METRIC_VALUE
-------------------------------------------------- ------------
Tablespace [ERP_INDEX] is [87 percent] full          87.5

再把告警阀值更改回原来的85%,发现告警信息又会出现在dba_outstanding_alerts中,不过已经是正确的87%了,而DBA_ALERT_HISTORY中的历史信息还是存在的,至此曲线解决了这个问题。

SQL> exec DBMS_SERVER_ALERT.SET_THRESHOLD(
metrics_id=>DBMS_SERVER_ALERT.TABLESPACE_PCT_FULL,
warning_operator=>DBMS_SERVER_ALERT.OPERATOR_GE,
warning_value=>85,
critical_operator=>DBMS_SERVER_ALERT.OPERATOR_GE,
critical_value=>97,
observation_period=>1,
consecutive_occurrences=>1,
instance_name=>NULL,
object_type=>DBMS_SERVER_ALERT.OBJECT_TYPE_TABLESPACE,
object_name=>NULL);

PL/SQL procedure successfully completed

SQL> col reason for a50
SQL> SELECT REASON
2  , METRIC_VALUE
3  , TO_CHAR(CREATION_TIME,'DD-MON-YYYY HH24:MI:SS') cdate
4  FROM SYS.DBA_OUTSTANDING_ALERTS;

REASON                                             METRIC_VALUE CDATE
-------------------------------------------------- ------------ -----------------------
Tablespace [ERP_INDEX] is [87 percent] full          87.5  8-JUN-2010 16:45:05

metalink上有个文档列举了一些解决表空间使用率提示的问题,不过都不适用于我这个案例:

Troubleshooting a Database Tablespace Used(%) Alert problem [ID 403264.1]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9390331/viewspace-709682/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/9390331/viewspace-709682/

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
可以使用 Prometheus 监控 Pod 使用率,并设置告警规则进行告警。以下是具体步骤: 1. 在 Kubernetes 集群中安装 Prometheus Operator。 2. 在 Prometheus Operator 中创建一个 ServiceMonitor,用于监控 Pod 使用率。 3. 在 Prometheus 中配置相应的指标收集规则,例如: ``` - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: keep regex: my-container - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod_name metric_relabel_configs: - source_labels: [__name__] action: replace target_label: metric regex: container_memory_usage_bytes - source_labels: [pod_name] action: replace target_label: pod regex: (.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace ``` 其中,`__meta_kubernetes_pod_container_name`和`__meta_kubernetes_pod_name`是 Kubernetes 的标准标签,可以用来选择要监控的 Pod 和容器。`container_memory_usage_bytes`是用来收集 Pod 内存使用率的指标。 4. 设置告警规则,例如: ``` groups: - name: pod-alerts rules: - alert: PodMemoryUsageHigh expr: sum(container_memory_usage_bytes{namespace="my-namespace", pod=~"my-pod.*", container="my-container"}) / sum(container_spec_memory_limit_bytes{namespace="my-namespace", pod=~"my-pod.*", container="my-container"}) > 0.8 for: 1m labels: severity: warning annotations: summary: "Memory usage is high in Pod {{ $labels.pod }}" description: "Memory usage is high ({{ printf "%.2f" $value }}%) in container {{ $labels.container }}" ``` 这个告警规则用于监控 Pod 内存使用率,当内存使用率超过 80% 时,触发告警,并在告警信息中包含 Pod 和容器的名称。 5. 在 Prometheus Operator 中配置 Alertmanager,用于接收和处理告警信息。 通过以上步骤,就可以使用 Prometheus 监控 Pod 使用率,并设置告警规则进行告警。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值