生产在线清理大表实战分享
@Ene_Takane
未经允许严禁转载
1、背景介绍
1.1、背景陈述
这是我在实际生产环境中遇到的一个问题,我先介绍一下:
在某公司的生产环境中,他们将告警信息存放在alerts
、alerts_count
两个表中,以timestamp
时间戳列作为主要的查询依据,告警平台,从这两个表中读取信息,来触发告警。
因为告警设置的问题,导致产生大量的告警,两个表共有超过千万行的告警记录,导致后台的select...left join...
查询语句超时,所以告警无法显示。
两个表的告警,会有一个保留天数,也就是说后台会自动执行清理语句
delete from alerts where timestamp < ? ;
delete from alerts_count where timestamp < ? ;
1.2、处理过程
在查明告警大量产生的原因之后,进行修改,已不再非正常产生告警。
因为是生成环境,担心delete会hang住MySQL,清理脚本采用limit进行限制,循环删除。
delete from table_name where timestamp < ? limit n ;
发现问题
alerts_counter
的timestamp
列没有索引,导致每次delete
都会全表扫描,这也是select
超时的罪魁祸首- 后台的
select
告警查询,delete
自动清理都无法关闭,导致和我们的清理产生锁冲突,进一步降低了速度
2、方案简介
2.1、 truncate方案
给alerts_counter表加排他锁,暂时阻塞后台查询以及自动清理
由于要直接清除掉alerts_counter表中的所有记录,所以直接执行truncate命令。
truncate table universe.alerts_counter;
truncate影响:
1、truncate 执行时要全程添加排他锁,阻塞alerts_counter读写。此时新产生的告警将无法写入
2、 truncate 会扫描 buffer pool LRU list,当 buffer pool很大,并且数据页都有数据时(即Innodb_buffer_pool_pages_data 很大时),truncate 耗时会很长,且会极大影响 MySQL 的整体性能。
2.2、drop方案
首先新建一个用来替换原alerts_counter表的临时表alerts_counter_temp;给新旧两个表加排他锁;然后使用rename命令,将两表相互替换
drop删除掉原alerts_counter表的操作,既可以在空闲时间,去操作。
新表中还会加入timestamp索引,优化以后的查询。
2.3、操作方法
详见2.3操作过程
3、测试方法
3.1、测试环境
CPU:4
memory:4GB
disk size:20GB / 40GB
innodb_buffer_pool_size:1GB
OS:CentOS 7
MySQL:MySQL 5.7.27
3.2、数据准备
准备300万测试数据
delimiter ;;
create procedure truncate_test()
begin
declare i int;
set i=1;
while(i<=3000000)do
insert into universe.alerts_counter values(i,now(),"ef3457cd",123456,"fdfgkulh");
set i=i+1;
end while;
end;;
delimiter ;
call truncate_test();
mysql> select count(1) from alerts_counter;
+----------+
| count(1) |
+----------+
| 3000000 |
+----------+
3.3、操作过程
3.3.1、truncate方案测试
1、锁表
mysql> lock table alerts_counter write;
Query OK, 0 rows affected (1 min 16.53 sec)
mysql> show processlist \G
# 查看锁表之后,所有和此表相关的线程,是否全部处于`State: Waiting for table metadata lock`状态,如未处于此状态,将其kill
2、直接删除
truncate table universe.alerts_counter;
Query OK, 0 rows affected (2 min 16.26 sec)
3、释放锁
unlock tables;
3.3.2、drop方案测试
1、传建临时表
use universe;
CREATE TABLE `alerts_counter_temp` (
......
......
INDEX (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
建表的时候,对原alerts_counter表加了timestamp索引,优化之后的查询。
2、锁表
lock table alerts_counter write;
Query OK, 0 rows affected (1 min 6.53 sec)
# 等待时间较长,在等待查询、清理线程释放锁资源,此期间不影响业务,也可手动关闭持有锁的线程来缩短此时间
lock table alerts_counter_temp write;
Query OK, 0 rows affected (0.00 sec)
show processlist \G
# 查看锁表之后,所有和此表相关的线程,是否全部处于`State: Waiting for table metadata lock`状态,如未处于此状态,将其kill
3、替换旧表
ALTER TABLE alerts_counter RENAME TO alerts_counter_old, ALGORITHM=INPLACE;
Query OK, 0 rows affected (0.05 sec)
ALTER TABLE alerts_counter_temp RENAME TO alerts_counter, ALGORITHM=INPLACE;
Query OK, 0 rows affected (0.07 sec)
替换完成后,手动添加的锁,已不会影响正常查询,清理业务。
至此,我们已经可以算是将alerts_counter表清理完成了。
4、释放锁
unlock tables;
5、删除旧表
alerts_counter_old可以在空闲时间删除,drop占用资源少,对性能影响小
drop table alerts_counter_old;
Query OK, 0 rows affected (0.08 sec)
3.4、测试说明
- 本次测试,存在自动清理机制,告警查询机制的影响,通过lock table添加排他锁,阻塞他们,使其处于
Waiting for table metadata lock
状态,避免影响后续操作。
-
加锁期间,将阻塞告警查询、新告警写入的动作,锁释放后继续执行。
-
清理操作,不会对其他组件的可用性产生影响。
4、结果
最终确定使用drop方案,执行期间只有在加锁时略有延时,通过kill掉阻塞加锁的查询、自动清理线程,快速完成加锁。
总体耗费时间3s左右,成功清除剩余的500多万行数据,期间未对业务产生不良影响。
5、清理脚本
以下仅供参考
5.1、alerts_counter清理
#! /bin/bash
###########
# VARIABLES
###########
MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'
STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out
###########
# functions
###########
# usage: log level msg
function log() {
echo "`date +'%F %T'` $1 $2"
}
# usage: executeSql sql extra_flags...
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
sql="$1"
shift 1
extraFlags=$@
connStr="-S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS"
if ! $MYSQL_CLIENT $connStr $extraFlags -e"$sql" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
log "ERROR" "Error in executing sql, see details belows: $sql"
cat $STDERR_LOG
return 1
fi
return 0
}
########
# main()
########
for table in "alerts_counter" ; do
deleted_rows=0
deleted_rows_total=0
total_rows=0
new_table="${table}_new"
temp_table="${table}_temp"
set +o posix
log "INFO" "Start to clean table ${table}"
# 1. create the new empty table with same structure
log "INFO" "Start to create the new empty table with same structure"
if ! executeSql "CREATE TABLE IF NOT EXISTS ${new_table} LIKE ${table}" ; then
exit 1
fi
log "INFO" "Finished creation of the new empty table with same structure"
# 2.1. LOCK-RENAME-UNLOCK
{
sql="LOCK TABLES ${table} WRITE, ${new_table} WRITE"
sql="${sql};ALTER TABLE ${table} RENAME TO ${temp_table}, ALGORITHM=INPLACE"
sql="${sql};ALTER TABLE ${new_table} RENAME TO ${table}, ALGORITHM=INPLACE"
log "INFO" "Start to LOCK-RENAME-UNLOCK"
if ! executeSql "${sql}" ; then
exit 1
fi
log "INFO" "Finished LOCK-RENAME-UNLOCK"
} &
# 2.2. kill all running slow queries
sleep 2
log "INFO" "Start to kill all running slow quries (SELECT ... LEFT JOIN ... and DELETE w/o LIIMIT)"
findSlowSql="SELECT CONCAT('KILL QUERY ',ID, ';') FROM information_schema.processlist WHERE"
findSlowSql="${findSlowSql} INFO LIKE '%LEFT JOIN%'"
findSlowSql="${findSlowSql} OR ( INFO LIKE 'DELETE%' AND INFO NOT LIKE '%LIMIT%')"
findSlowSql="${findSlowSql} AND ID != CONNECTION_ID()"
while read sql ; do
executeSql "sql" "-N" "-s"
done < <(executeSql "${findSlowSql}" "-N" "-s")
log "INFO" "Finished killing all running slow quries (SELECT ... LEFT JOIN ... and DELETE w/o LIIMIT)"
log "INFO" "Waiting for LOCK-RENAME-UNLOCK to finish"
wait %1
# 3. drop useless table
log "INFO" "Start to drop useless table(${temp_table})"
if ! executeSql "DROP TABLE IF EXISTS ${temp_table}" ; then
exit 1
fi
log "INFO" "Finished dropping of useless table(${temp_table})"
log "INFO" "Finished cleaning of table ${table}"
done
5.2、alerts清理
#! /bin/bash
###########
# VARIABLES
###########
MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'
STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out
TIME_CONDITION="timestamp < '2019-10-30 23:59:59'"
###########
# functions
###########
# usage: log level msg
function log() {
echo "`date +'%F %T'` $1 $2"
}
# usage: executeSql sql
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
if ! $MYSQL_CLIENT --show-warnings -S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS -e"$1" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
log "ERROR" "Error in executing sql, see details belows: $sql"
cat $STDERR_LOG
return 1
fi
return 0
}
########
# main()
########
for table in "alerts" ; do
deleted_rows=0
deleted_rows_total=0
total_rows=0
log "INFO" "Start to clean table ${table}"
# 1. get total rows
if ! executeSql "SELECT COUNT(1) FROM ${table} WHERE ${TIME_CONDITION}" ; then
exit 1
fi
total_rows="`sed -nre '2p' $STDOUT_LOG`"
log "INFO" "Found $total_rows rows in ${table} need to delete."
# 2. do batch delete repeatly
while true ; do
if ! executeSql "DELETE FROM ${table} WHERE ${TIME_CONDITION} LIMIT 100; SELECT ROW_COUNT();" ; then
exit 1
fi
deleted_rows="`sed -nre '2p' $STDOUT_LOG`"
let deleted_rows_total+=deleted_rows
if (($deleted_rows_total % 1000 > 0 && $deleted_rows != 0)) ; then
continue
fi
log "INFO" "deleted ${deleted_rows} rows, ${deleted_rows_total}/${total_rows} in total"
if (($deleted_rows == 0)); then
log "INFO" "finished cleaning of $table"
break
fi
done
done
5.3、循环kill线程
#! /bin/bash
###########
# VARIABLES
###########
MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'
STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out
INTERVAL='5'
###########
# functions
###########
# usage: log level msg
function log() {
echo "`date +'%F %T'` $1 $2"
}
# usage: executeSql sql extra_flags...
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
sql="$1"
shift 1
extraFlags=$@
connStr="-S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS"
if ! $MYSQL_CLIENT $connStr $extraFlags -e"$sql" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
log "ERROR" "Error in executing sql, see details belows: $sql"
cat $STDERR_LOG
return 1
fi
return 0
}
########
# main()
########
findSlowSql="SELECT CONCAT('KILL QUERY ',ID, ';') FROM information_schema.processlist WHERE"
findSlowSql="${findSlowSql} INFO LIKE '%LEFT JOIN%'"
findSlowSql="${findSlowSql} OR ( INFO LIKE 'DELETE%' AND INFO NOT LIKE '%LIMIT%')"
findSlowSql="${findSlowSql} AND ID != CONNECTION_ID()"
while sleep $INTERVAL ; do
log "INFO" "do killing"
while read sql ; do
executeSql "sql" "-N" "-s"
done < <(executeSql "${findSlowSql}" "-N" "-s")
log "INFO" finished killing
done