TIDB突然断电pd文件损坏启动失败

记录一次TIDB数据库突然断电导致db文件损坏导致集群启动失败

pd.log错误日志输出

["run server failed"] [error="[PD:leveldb:ErrLevelDBOpen]leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000030]"] [stack="main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:122\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"] 

使用pd-recover修复pd

详细文档:PD Recover 使用文档 | PingCAP 文档中心

一.获取 Cluster ID

从 PD 日志获取 Cluster ID

cd /TiDB/tidb-deploy/pd-2379/log/

cat pd.log | grep "init cluster id"

[2022/04/20 12:23:07.079 +08:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7088536805883498676]

二.获取已分配 ID

cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1

3500

三.部署一套新的 PD 集群

部署新的 PD 集群之前,需要停止当前的 PD 集群,然后删除旧的数据目录

1.停止旧集群

tiup cluster stop tidb

2.删除旧集群pd的数据目录

mv /TiDB/tidb-data/pd-2379 /TiDB/tidb-data/pd-2379.bak

3.修改旧集群的meta.yaml注释掉pd相关配置

/root/.tiup/storage/cluster/clusters/tidb/meta.yaml

4.部署新的pd

创建新topology.yaml使用以下配置

# # Global variables are applied to all deployments and used as the default value of
# # the deployments if a specific deployment value is missing.
global:
  # # The user who runs the tidb cluster.
  user: "tidb"
  # # group is used to specify the group name the user belong to if it's not the same as user.
  # group: "tidb"
  # # SSH port of servers in the managed cluster.
  ssh_port: 22
  # # Storage directory for cluster deployment files, startup scripts, and configuration files.
  deploy_dir: "/TiDB/tidb-deploy"
  # # TiDB Cluster data storage directory
  data_dir: "/TiDB/tidb-data"
  # # Supported values: "amd64", "arm64" (default: "amd64")
  arch: "amd64"
  # # Resource Control is used to limit the resource of an instance.
  # # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
  # # Supports using instance-level `resource_control` to override global `resource_control`.
  # resource_control:
  #   # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryLimit=bytes
  #   memory_limit: "2G"
  #   # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#CPUQuota=
  #   # The percentage specifies how much CPU time the unit shall get at maximum, relative to the total CPU time available on one CPU. Use values > 100% for allotting CPU time on more than one CPU.
  #   # Example: CPUQuota=200% ensures that the executed processes will never get more than two CPU time.
  #   cpu_quota: "200%"
  #   # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#IOReadBandwidthMax=device%20bytes
  #   io_read_bandwidth_max: "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 100M"
  #   io_write_bandwidth_max: "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 100M"
​
# # Server configs are used to specify the runtime configuration of TiDB components.
# # All configuration items can be found in TiDB docs:
# # - TiDB: https://pingcap.com/docs/stable/reference/configuration/tidb-server/configuration-file/
# # - TiKV: https://pingcap.com/docs/stable/reference/configuration/tikv-server/configuration-file/
# # - PD: https://pingcap.com/docs/stable/reference/configuration/pd-server/configuration-file/
# # - TiFlash: https://docs.pingcap.com/tidb/stable/tiflash-configuration
# #
# # All configuration items use points to represent the hierarchy, e.g:
# #   readpool.storage.use-unified-pool
# #           ^       ^
# # - example: https://github.com/pingcap/tiup/blob/master/examples/topology.example.yaml.
# # You can overwrite this configuration via the instance-level `config` field.
# server_configs:
  # tidb:
  # tikv:
  # pd:
  # tiflash:
  # tiflash-learner:
monitored:
  # # The communication port for reporting system information of each node in the TiDB cluster.
  node_exporter_port: 9120
  # # Blackbox_exporter communication port, used for TiDB cluster port monitoring.
  blackbox_exporter_port: 9125
  # # Storage directory for deployment files, startup scripts, and configuration files of monitoring components.
  deploy_dir: "/TiDB/tidb-deploy/monitored-9120"
  # # Data storage directory of monitoring components.
  data_dir: "/TiDB/tidb-data/monitored-9120"
  # # Log storage directory of the monitoring component.
  log_dir: "/TiDB/tidb-deploy/monitored-9120/log"
# # Server configs are used to specify the configuration of PD Servers.
pd_servers:
  - host: 10.66.0.135
  # # The ip address of the PD Server.
    deploy_dir: "/TiDB/tidb-deploy/pd-2379"
    data_dir: "/TiDB/tidb-data/pd-2379"
    log_dir: "/TiDB/tidb-deploy/pd-2379/log"

部署新pd

版本要和旧集群同个版本

tiup cluster deploy tidb-test v5.4.0 ./topology.yaml --user root -p

启动新pd

tiup cluster start tidb-test

三、使用pd-recover工具去修复

工具包下载地址:https://download.pingcap.org/tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz

1.安装pd-recover

tar -xf tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz

cd tidb-community-toolkit-v5.4.1-linux-amd64/bin

cp pd-recover /TiDB/tidb-deploy/pd-2379/bin

2.使用 pd-recover

cd /TiDB/tidb-deploy/pd-2379/bin

./pd-recover -endpoints http://10.66.0.141:2379 -cluster-id 7088536805883498676 -alloc-id 35000

四、重启就集群

1.停止新pd集群

tiup cluster stop tidb-test

2.删除或改名新pd集群的meta.yaml文件

mv /root/.tiup/storage/cluster/clusters/tidb-test/meta.yaml /root/.tiup/storage/cluster/clusters/tidb-test/meta.yaml.bak

3.取消旧集群meta.yaml文件pd配置的注销

/root/.tiup/storage/cluster/clusters/tidb/meta.yaml

4.重启旧集群

tiup cluster restart tidb

如果有部分tikv没有起来可以再执行一次 tiup cluster start tidb

pd修复完成

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值