记录一次TIDB数据库突然断电导致db文件损坏导致集群启动失败
pd.log错误日志输出
["run server failed"] [error="[PD:leveldb:ErrLevelDBOpen]leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000030]"] [stack="main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:122\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"]
使用pd-recover修复pd
详细文档:PD Recover 使用文档 | PingCAP 文档中心
一.获取 Cluster ID
从 PD 日志获取 Cluster ID
cd /TiDB/tidb-deploy/pd-2379/log/
cat pd.log | grep "init cluster id"
[2022/04/20 12:23:07.079 +08:00] [INFO] [server.go:358] ["init cluster id"] [cluster-id=7088536805883498676]
二.获取已分配 ID
cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1
3500
三.部署一套新的 PD 集群
部署新的 PD 集群之前,需要停止当前的 PD 集群,然后删除旧的数据目录
1.停止旧集群
tiup cluster stop tidb
2.删除旧集群pd的数据目录
mv /TiDB/tidb-data/pd-2379 /TiDB/tidb-data/pd-2379.bak
3.修改旧集群的meta.yaml注释掉pd相关配置
/root/.tiup/storage/cluster/clusters/tidb/meta.yaml
4.部署新的pd
创建新topology.yaml
使用以下配置
# # Global variables are applied to all deployments and used as the default value of # # the deployments if a specific deployment value is missing. global: # # The user who runs the tidb cluster. user: "tidb" # # group is used to specify the group name the user belong to if it's not the same as user. # group: "tidb" # # SSH port of servers in the managed cluster. ssh_port: 22 # # Storage directory for cluster deployment files, startup scripts, and configuration files. deploy_dir: "/TiDB/tidb-deploy" # # TiDB Cluster data storage directory data_dir: "/TiDB/tidb-data" # # Supported values: "amd64", "arm64" (default: "amd64") arch: "amd64" # # Resource Control is used to limit the resource of an instance. # # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html # # Supports using instance-level `resource_control` to override global `resource_control`. # resource_control: # # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryLimit=bytes # memory_limit: "2G" # # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#CPUQuota= # # The percentage specifies how much CPU time the unit shall get at maximum, relative to the total CPU time available on one CPU. Use values > 100% for allotting CPU time on more than one CPU. # # Example: CPUQuota=200% ensures that the executed processes will never get more than two CPU time. # cpu_quota: "200%" # # See: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#IOReadBandwidthMax=device%20bytes # io_read_bandwidth_max: "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 100M" # io_write_bandwidth_max: "/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0 100M" # # Server configs are used to specify the runtime configuration of TiDB components. # # All configuration items can be found in TiDB docs: # # - TiDB: https://pingcap.com/docs/stable/reference/configuration/tidb-server/configuration-file/ # # - TiKV: https://pingcap.com/docs/stable/reference/configuration/tikv-server/configuration-file/ # # - PD: https://pingcap.com/docs/stable/reference/configuration/pd-server/configuration-file/ # # - TiFlash: https://docs.pingcap.com/tidb/stable/tiflash-configuration # # # # All configuration items use points to represent the hierarchy, e.g: # # readpool.storage.use-unified-pool # # ^ ^ # # - example: https://github.com/pingcap/tiup/blob/master/examples/topology.example.yaml. # # You can overwrite this configuration via the instance-level `config` field. # server_configs: # tidb: # tikv: # pd: # tiflash: # tiflash-learner: monitored: # # The communication port for reporting system information of each node in the TiDB cluster. node_exporter_port: 9120 # # Blackbox_exporter communication port, used for TiDB cluster port monitoring. blackbox_exporter_port: 9125 # # Storage directory for deployment files, startup scripts, and configuration files of monitoring components. deploy_dir: "/TiDB/tidb-deploy/monitored-9120" # # Data storage directory of monitoring components. data_dir: "/TiDB/tidb-data/monitored-9120" # # Log storage directory of the monitoring component. log_dir: "/TiDB/tidb-deploy/monitored-9120/log" # # Server configs are used to specify the configuration of PD Servers. pd_servers: - host: 10.66.0.135 # # The ip address of the PD Server. deploy_dir: "/TiDB/tidb-deploy/pd-2379" data_dir: "/TiDB/tidb-data/pd-2379" log_dir: "/TiDB/tidb-deploy/pd-2379/log"
部署新pd
版本要和旧集群同个版本
tiup cluster deploy tidb-test v5.4.0 ./topology.yaml --user root -p
启动新pd
tiup cluster start tidb-test
三、使用pd-recover工具去修复
工具包下载地址:https://download.pingcap.org/tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz
1.安装pd-recover
tar -xf tidb-community-toolkit-v5.4.1-linux-amd64.tar.gz
cd tidb-community-toolkit-v5.4.1-linux-amd64/bin
cp pd-recover /TiDB/tidb-deploy/pd-2379/bin
2.使用 pd-recover
cd /TiDB/tidb-deploy/pd-2379/bin
./pd-recover -endpoints http://10.66.0.141:2379 -cluster-id 7088536805883498676 -alloc-id 35000
四、重启就集群
1.停止新pd集群
tiup cluster stop tidb-test
2.删除或改名新pd集群的meta.yaml
文件
mv /root/.tiup/storage/cluster/clusters/tidb-test/meta.yaml /root/.tiup/storage/cluster/clusters/tidb-test/meta.yaml.bak
3.取消旧集群meta.yaml
文件pd配置的注销
/root/.tiup/storage/cluster/clusters/tidb/meta.yaml
4.重启旧集群
tiup cluster restart tidb
如果有部分tikv没有起来可以再执行一次 tiup cluster start tidb
pd修复完成