达梦DSC集群的故障节点清理操作

10 篇文章 1 订阅

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


前言

本人在搭建完毕dsc两节点的集群后,暂且称呼前两台节点为A和B。于是尝试进行动态扩展节点的操作,由于在C节点上启动dmcss服务时出现了一些问题,于是本人将A和B的虚拟机快照还原到了刚开始搭建的DSC两节点环境。但是当我重启dmasmsvr服务时缺报错,报错内容如下
内容是找不到我C节点的ASM2信息,但是快照后我检查了各项配置文件,文件里均只有ASM0跟ASM1的信息,未发现ASM2的信息,那么这个ASM2的信息是从哪来的?为什么我快照还原了后会去读取一个我没有的信息文件。
那么我现在怎么样才能将初始dsc两节点环境复原回来成功启动呢,于是经过我的研究,将此故障问题做个记录,给大家做个参考

[root@czk1 bin]# ./DmASMSvrServicesvr start
Starting DmASMSvrServicesvr: Last login: Mon May 16 11:21:34 CST 2022
                                                           [ FAILED ]
instance(ASM2) mal config not found in /opt/dmdbms/data/DAMENG/dmasvrmal.ini
mal cfg sys init error, code:[-9501], desc:[MAL sys has not configured or server is not enterprise version].
然后这是查看dmcssm服务获取到的信息,我们需要做的就是将css2,ASM2,以及dsc2信息清除掉
p:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 13:02:21    CSS0          0         9341    Control Node OPEN               WORKING      OK           TRUE         1054675793        1054676229      
	2022-05-16 13:02:21    CSS1          1         9343    Normal Node  OPEN               WORKING      OK           TRUE         1054684694        1054685105      
	2022-05-16 13:02:21    CSS2          2         9344    Normal Node  SHUTDOWN           UNKNOWN      OK           FALSE        0                 0               

=================== group[name = GRP_ASM, seq = 1, type = ASM, Control Node = 0] ========================================

n_ok_ep = 2
ok_ep_arr(index, seqno):
(0, 0)
(1, 1)

sta = OPEN, sub_sta = STARTUP
break ep = NULL
recover ep = NULL

crash process over flag is TRUE
ep:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 13:02:21    ASM0          0         9349    Control Node OPEN               WORKING      OK           TRUE         1054791798        1054791865      
	2022-05-16 13:02:21    ASM1          1         9351    Normal Node  OPEN               WORKING      OK           TRUE         1054798516        1054798565      
	2022-05-16 13:02:21    ASM2          2         9352    Normal Node  SHUTDOWN           UNKNOWN      ERROR        FALSE        0                 0               

=================== group[name = GRP_DSC, seq = 2, type = DB, Control Node = 255] ========================================

n_ok_ep = 3
ok_ep_arr(index, seqno):
(0, 0)
(1, 1)
(2, 2)

sta = OPEN, sub_sta = STARTUP
break ep = NULL
recover ep = NULL

crash process over flag is FALSE
ep:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 13:02:21    DSC0          0         5236    Normal Node  OPEN               WORKING      OK           FALSE        807630687         807634940       
	2022-05-16 13:02:21    DSC1          1         5237    Normal Node  OPEN               WORKING      OK           FALSE        807542710         807547246       
	2022-05-16 13:02:21    DSC2          2         5238    Normal Node  SHUTDOWN           UNKNOWN      OK           FALSE        0                 0               

==================================================================================================================


一、DMASMSVR服务启动失败的问题原因

其实快照还原后曝出故障的原因是因为之前我在做动态扩容节点的时候,已经将内容信息记录到了共享磁盘上,于是,当你快照还原后,只是将操作系统层面的文件进行还原了,所以配置文件里的都是正确的,只不过是共享磁盘上的残留信息导致在启动服务的时候,服务会去共享磁盘上检索一下,发现有之前的信息,就进行读取加载,然后就报错了,下面就基于这个问题,将dsc集群中的故障节点信息进行清除

二、dsc故障节点清理

1.首先登录dmasmcmd工具,将你当前DSC集群的DCR盘信息导出到dmdcr_cfg_bak.ini中

[root@czk bin]# ./dmasmcmd 
DMASMCMD V8
ASM>export dcrdisk '/dev/raw/raw1' to '/opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini'
ASMCMD export DCRDISK success.
Used time: 6.069(ms).

2.登录dmasmtool工具,将之前添加的日志文件内容进行删除
这边DSC2_log01.log跟DSC2_log02.log是我之前做拓展节点的时候新建的日志,这边进行删除(如果你还有归档的配置信息,在tool里记得删除,我这边没配,所以只需要删除日志就行)

[root@czk bin]# ./dmasmtool dcr_ini=/opt/dmdbms/data/DAMENG/dmdcr.ini
DMASMTOOL V8
ASM>ls
	file : dsc0_log01.log
	file : dsc0_log02.log
	file : dsc1_log01.log
	file : dsc1_log02.log
	file : DSC2_log01.log
	file : DSC2_log02.log
total count 6.
Used time: 5.116(ms).
ASM>rm -rf DSC2_log01.log
Used time: 4.959(ms).
ASM>rm -rf DSC2_log02.log 
Used time: 5.512(ms).

注意,dmasmtool工具的启动需要保证dmcss跟dmasmsvr服务的正常运行,否则启动工具会报连接异常(因为我本身dmasmsvr服务启动就失败嘛,因为找不到ASM2,所以我是在dmasmsvr.ini文件中新增了一项ASM2的信息,先让服务能够起来)

[root@czk bin]# ./dmasmtool dcr_ini=/opt/dmdbms/data/DAMENG/dmdcr.ini
DMASMTOOL V8
[code : -11041] ASM连接异常

3.关闭所有服务包括数据库、css、svr服务
我这边就直接kill杀掉了

[root@czk bin]# netstat -ntulp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN      1897/dnsmasq        
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1396/sshd           
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1772/master         
tcp        0      0 127.0.0.1:6010          0.0.0.0:*               LISTEN      2138/sshd: root@pts 
tcp6       0      0 :::4236                 :::*                    LISTEN      1547/dmap           
tcp6       0      0 :::7246                 :::*                    LISTEN      2401/dmasmsvr       
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::80                   :::*                    LISTEN      1405/httpd          
tcp6       0      0 :::5236                 :::*                    LISTEN      1542/dmserver       
tcp6       0      0 :::22                   :::*                    LISTEN      1396/sshd           
tcp6       0      0 ::1:25                  :::*                    LISTEN      1772/master         
tcp6       0      0 ::1:6010                :::*                    LISTEN      2138/sshd: root@pts 
tcp6       0      0 :::9341                 :::*                    LISTEN      1554/dmcss          
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           946/avahi-daemon: r 
udp        0      0 0.0.0.0:1830            0.0.0.0:*                           1217/dhclient       
udp        0      0 0.0.0.0:53266           0.0.0.0:*                           946/avahi-daemon: r 
udp        0      0 192.168.122.1:53        0.0.0.0:*                           1897/dnsmasq        
udp        0      0 0.0.0.0:67              0.0.0.0:*                           1897/dnsmasq        
udp        0      0 0.0.0.0:68              0.0.0.0:*                           1217/dhclient       
udp        0      0 127.0.0.1:323           0.0.0.0:*                           977/chronyd         
udp6       0      0 :::25482                :::*                                1217/dhclient       
udp6       0      0 :::69                   :::*                                1/systemd           
udp6       0      0 ::1:323                 :::*                                977/chronyd         
[root@czk bin]# kill -9 1542 1554 2401

4.修改dmdcr_cfg_bak.ini文件
我们在第一步的时候使用cmd工具导出了一份文件,然后我们在这里进行修改(下图是我已经修改好的)修改的目的是将拓展节点的信息删除,保留源两节点的dsc信息
具体的修改内容为
将所有DCR_GRP_N_EP = 3 修改为 DCR_GRP_N_EP = 2
将所有的DCR_GRP_EP_ARR = {0,1,2} 修改为 DCR_GRP_EP_ARR = {0,1}
将所有拓展节点的信息如CSS2、ASM2、DSC2的节点信息给删除

注意!因为我是快照回来的环境,所以其他配置文件我就没去做修改,如果你们不是快照还原回来的环境,记得把dmmal.ini、dmasmsvr.ini、dmcfg机器上的配置再检查下,看看有没有多出来的节点信息

[dmdba@czk ~]$ cat /opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini 
# the file is auto-created by system, self edit is invalid!
#DCR HDR
DCR_N_GRP              = 3
DCR_VTD_PATH           = /dev/raw/raw2
DCR_OGUID              = 63635

[GRP]
DCR_GRP_TYPE           = CSS
DCR_GRP_NAME           = GRP_CSS
DCR_GRP_N_EP           = 2
DCR_GRP_EP_ARR         = {0,1}
DCR_GRP_N_ERR_EP       = 0
DCR_GRP_ERR_EP_ARR     = {}
DCR_GRP_DSKCHK_CNT     = 60

[GRP]
DCR_GRP_TYPE           = ASM
DCR_GRP_NAME           = GRP_ASM
DCR_GRP_N_EP           = 2
DCR_GRP_EP_ARR         = {0,1}
DCR_GRP_N_ERR_EP       = 0
DCR_GRP_ERR_EP_ARR     = {}
DCR_GRP_DSKCHK_CNT     = 60

[GRP]
DCR_GRP_TYPE           = DB
DCR_GRP_NAME           = GRP_DSC
DCR_GRP_N_EP           = 2
DCR_GRP_EP_ARR         = {0,1}
DCR_GRP_N_ERR_EP       = 0
DCR_GRP_ERR_EP_ARR     = {}
DCR_GRP_DSKCHK_CNT     = 60

[GRP_CSS]
DCR_EP_NAME        = CSS0
DCR_EP_HOST        = 192.168.17.133
DCR_EP_PORT        = 9341

[GRP_CSS]
DCR_EP_NAME        = CSS1
DCR_EP_HOST        = 192.168.17.132
DCR_EP_PORT        = 9343

[GRP_ASM]
DCR_EP_NAME        = ASM0
DCR_EP_SHM_KEY     = 93360
DCR_EP_SHM_SIZE    = 20
DCR_EP_HOST        = 192.168.17.133
DCR_EP_PORT        = 9349
DCR_EP_ASM_LOAD_PATH  = /dev/raw

[GRP_ASM]
DCR_EP_NAME        = ASM1
DCR_EP_SHM_KEY     = 93361
DCR_EP_SHM_SIZE    = 20
DCR_EP_HOST        = 192.168.17.132
DCR_EP_PORT        = 9351
DCR_EP_ASM_LOAD_PATH  = /dev/raw

[GRP_DSC]
DCR_EP_NAME        = DSC0
DCR_EP_SEQNO       = 0
DCR_EP_PORT        = 5236
DCR_CHECK_PORT     = 9741

[GRP_DSC]
DCR_EP_NAME        = DSC1
DCR_EP_SEQNO       = 1
DCR_EP_PORT        = 5237
DCR_CHECK_PORT     = 9742

5.将修改后的dmdcr_cfg_bak.ini重新初始化为DCR和VOTE盘
这一步是将信息重新导入到共享磁盘上,记得服务还是关闭状态的否则会提示如下报错

ASM>init dcrdisk '/dev/raw/raw1' from '/opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini' identified by 'abcd'
[Trace]DG 126 alloc one extent for inodes, addr(disk_id, disk_auno, extent_no):(0,0,1).
[Trace]DG 126 allocate 4 extents for file 0xfe000002.
[Trace]DG 126 alloc 4 extents for 0xfe000002, addr(disk_id, disk_auno, extent_no):(0, 0, 2)->(0, 0, 5), need_init = 1.
init /dev/raw/raw1 from /opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini failed!
[code: -11034], 磁盘[/dev/raw/raw1]正在使用中

初始化为DCR和vote

ASM>init dcrdisk '/dev/raw/raw1' from '/opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini' identified by 'abcd'
[Trace]DG 126 alloc one extent for inodes, addr(disk_id, disk_auno, extent_no):(0,0,1).
[Trace]DG 126 allocate 4 extents for file 0xfe000002.
[Trace]DG 126 alloc 4 extents for 0xfe000002, addr(disk_id, disk_auno, extent_no):(0, 0, 2)->(0, 0, 5), need_init = 1.
Used time: 00:00:14.488.
ASM>int votedisk '/dev/raw/raw2' from '/opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini'
syntax error
asmcmd parse failed!
ASM>init votedisk '/dev/raw/raw2' from '/opt/dmdbms/data/DAMENG/dmdcr_cfg_bak.ini'
[Trace]DG 125 alloc one extent for inodes, addr(disk_id, disk_auno, extent_no):(0,0,1).
[Trace]DG 125 allocate 4 extents for file 0xfd000002.
[Trace]DG 125 alloc 4 extents for 0xfd000002, addr(disk_id, disk_auno, extent_no):(0, 0, 2)->(0, 0, 5), need_init = 1.
Used time: 00:00:14.568.

6.然后将服务重新启动起来,记得两台依次启动,先启dmcss后起dmasmsvr

[root@czk bin]# ./DmCSSServicecss start
Starting DmCSSServicecss: 上一次登录:一 516 14:13:31 CST 2022
                                                           [ OK ]
[root@czk bin]# ./DmASMSvrServicesvr start
Starting DmASMSvrServicesvr: 上一次登录:一 516 14:27:28 CST 2022
                                                           [ OK ]

7.通过dmcssm监控器查看下dsc信息,dsc故障节点已经清理掉了,然后成功启动初始dsc集群环境

[root@czk bin]# ./dmcssm ini_path=/opt/dmdbms/data/DAMENG/dmcssm.ini 
[monitor]         2022-05-16 14:31:25: CSS MONITOR V8
[monitor]         2022-05-16 14:31:25: CSS MONITOR SYSTEM IS READY.

[monitor]         2022-05-16 14:31:25: Wait CSS Control Node choosed...
[monitor]         2022-05-16 14:31:26: Wait CSS Control Node choosed succeed.

show

monitor current time:2022-05-16 14:31:35, n_group:3
=================== group[name = GRP_CSS, seq = 0, type = CSS, Control Node = 0] ========================================

[CSS0] auto check = TRUE, global info:
[ASM0] auto restart = FALSE
[DSC0] auto restart = FALSE

[CSS1] auto check = TRUE, global info:
[ASM1] auto restart = FALSE
[DSC1] auto restart = FALSE


ep:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 14:31:34    CSS0          0         9341    Control Node OPEN               WORKING      OK           TRUE         1056417364        1056417608      
	2022-05-16 14:31:34    CSS1          1         9343    Normal Node  OPEN               WORKING      OK           TRUE         1056425684        1056425904      

=================== group[name = GRP_ASM, seq = 1, type = ASM, Control Node = 0] ========================================

n_ok_ep = 2
ok_ep_arr(index, seqno):
(0, 0)
(1, 1)

sta = OPEN, sub_sta = STARTUP
break ep = NULL
recover ep = NULL

crash process over flag is TRUE
ep:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 14:31:34    ASM0          0         9349    Control Node OPEN               WORKING      OK           TRUE         1056443218        1056443380      
	2022-05-16 14:31:34    ASM1          1         9351    Normal Node  OPEN               WORKING      OK           TRUE         1056449693        1056449837      

=================== group[name = GRP_DSC, seq = 2, type = DB, Control Node = 255] ========================================

n_ok_ep = 2
ok_ep_arr(index, seqno):
(0, 0)
(1, 1)

sta = OPEN, sub_sta = STARTUP
break ep = NULL
recover ep = NULL

crash process over flag is FALSE
ep:	css_time               inst_name     seqno     port    mode         inst_status        vtd_status   is_ok        active       guid              ts              
	2022-05-16 14:31:34    DSC0          0         5236    Normal Node  SHUTDOWN           WORKING      OK           FALSE        807630687         807634940       
	2022-05-16 14:31:34    DSC1          1         5237    Normal Node  SHUTDOWN           WORKING      OK           FALSE        807542710         807547246       

==================================================================================================================

8.然后我们再次启动dsc的dmserver服务,发现启动依旧有问题
通过查看启动日志发现,它提示DSC2log不存在,说明这部分信息还是没有清理干净

[root@czk log]# cat DmServicedsc.log
file dm.key not found, use default license!
version info: develop
DM Database Server x64 V8 1-2-94-21.11.11-150650-10038-ENT  startup...
Normal of FAST
Normal of DEFAULT
Normal of RECYCLE
Normal of KEEP
Normal of ROLL
Database mode = 0, oguid = 0
+DMLOG/log/DSC2_log01.log not exist, can not startup

9.通过dmctlcvt工具将dm.ctl文件转换成文本文件进行编辑

[root@czk bin]# ./dmctlcvt type=1 src=+DMDATA/data/dsc/dm.ctl dest=/opt/dmdbms/data/DAMENG/dmctrl.txt dcr_ini=/opt/dmdbms/data/DAMENG/dmdcr.ini
DMCTLCVT V8
convert ctl to txt success!

然后vim /opt/dmdbms/data/DAMENG/dmctrl.txt 文件,找到DSC2_log01.log部分的内容,将内容DSC2_log01.log和DSC2_log02.log内容进行删除
10.通过dmctlcvt工具将文本文件转化成dm.ctl控制文件

[root@czk bin]# ./dmctlcvt type=2 src=/opt/dmdbms/data/DAMENG/dmctrl.txt dest=+DMDATA/data/dsc/dm.ctl dcr_ini=/opt/dmdbms/data/DAMENG/dmdcr.ini
DMCTLCVT V8
convert txt to ctl success!

11.重新启动dmserver服务,启动成功!

[root@czk bin]# systemctl status DmServicedsc
● DmServicedsc.service - DM Instance Service(DmServicedsc).
   Loaded: loaded (/usr/lib/systemd/system/DmServicedsc.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2022-05-16 16:05:29 CST; 9min ago
  Process: 4202 ExecStart=/opt/dmdbms/bin/DmServicedsc start (code=exited, status=0/SUCCESS)
 Main PID: 4229 (dmserver)
   CGroup: /system.slice/DmServicedsc.service
           └─4229 /opt/dmdbms/bin/dmserver path=/opt/dmdbms/data/DAMENG/dsc0_config/dm.ini dcr_ini=/opt/dmdbms/data/DAM...

516 16:05:13 czk systemd[1]: Starting DM Instance Service(DmServicedsc)....
516 16:05:14 czk DmServicedsc[4202]: Starting DmServicedsc: connnect dmasmtool successfully.
516 16:05:29 czk DmServicedsc[4202]: [11B blob data]
516 16:05:29 czk systemd[1]: Started DM Instance Service(DmServicedsc)..


总结

如果你还有其他问题,欢迎到达梦社区来提问~
社区地址:https://eco.dameng.com

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值