生产在线清理大表实战分享

生产在线清理大表实战分享

@Ene_Takane

未经允许严禁转载


1、背景介绍

1.1、背景陈述

这是我在实际生产环境中遇到的一个问题,我先介绍一下:

在某公司的生产环境中,他们将告警信息存放在alertsalerts_count两个表中,以timestamp时间戳列作为主要的查询依据,告警平台,从这两个表中读取信息,来触发告警。

因为告警设置的问题,导致产生大量的告警,两个表共有超过千万行的告警记录,导致后台的select...left join...查询语句超时,所以告警无法显示。

两个表的告警,会有一个保留天数,也就是说后台会自动执行清理语句

delete from alerts where timestamp < ? ;
delete from alerts_count where timestamp < ? ;

1.2、处理过程

在查明告警大量产生的原因之后,进行修改,已不再非正常产生告警。

因为是生成环境,担心delete会hang住MySQL,清理脚本采用limit进行限制,循环删除。

delete from table_name where timestamp < ? limit n ;

发现问题

  • alerts_countertimestamp列没有索引,导致每次delete都会全表扫描,这也是select超时的罪魁祸首
  • 后台的select告警查询,delete自动清理都无法关闭,导致和我们的清理产生锁冲突,进一步降低了速度

2、方案简介

2.1、 truncate方案

给alerts_counter表加排他锁,暂时阻塞后台查询以及自动清理

由于要直接清除掉alerts_counter表中的所有记录,所以直接执行truncate命令。

truncate table universe.alerts_counter;

truncate影响:

1、truncate 执行时要全程添加排他锁,阻塞alerts_counter读写。此时新产生的告警将无法写入

2、 truncate 会扫描 buffer pool LRU list,当 buffer pool很大,并且数据页都有数据时(即Innodb_buffer_pool_pages_data 很大时),truncate 耗时会很长,且会极大影响 MySQL 的整体性能。

2.2、drop方案

首先新建一个用来替换原alerts_counter表的临时表alerts_counter_temp;给新旧两个表加排他锁;然后使用rename命令,将两表相互替换

drop删除掉原alerts_counter表的操作,既可以在空闲时间,去操作。

新表中还会加入timestamp索引,优化以后的查询。

2.3、操作方法

详见2.3操作过程

3、测试方法

3.1、测试环境

CPU:4

memory:4GB

disk size:20GB / 40GB

innodb_buffer_pool_size:1GB

OS:CentOS 7

MySQL:MySQL 5.7.27

3.2、数据准备

准备300万测试数据

delimiter ;;
create procedure truncate_test()
begin
declare i int;
set i=1;
while(i<=3000000)do
 insert into universe.alerts_counter values(i,now(),"ef3457cd",123456,"fdfgkulh");
 set i=i+1;
end while;
end;;
delimiter ;

call truncate_test();

mysql> select count(1) from alerts_counter;
+----------+
| count(1) |
+----------+
|  3000000 |
+----------+

3.3、操作过程

3.3.1、truncate方案测试

1、锁表

mysql> lock table alerts_counter write;
Query OK, 0 rows affected (1 min 16.53 sec)
mysql> show processlist \G      
# 查看锁表之后,所有和此表相关的线程,是否全部处于`State: Waiting for table metadata lock`状态,如未处于此状态,将其kill

2、直接删除

truncate table universe.alerts_counter;
Query OK, 0 rows affected (2 min 16.26 sec)

3、释放锁

unlock tables;
3.3.2、drop方案测试

1、传建临时表

use universe;
CREATE TABLE `alerts_counter_temp` (
......
......
  INDEX (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

建表的时候,对原alerts_counter表加了timestamp索引,优化之后的查询。

2、锁表

lock table alerts_counter write;
Query OK, 0 rows affected (1 min 6.53 sec)
# 等待时间较长,在等待查询、清理线程释放锁资源,此期间不影响业务,也可手动关闭持有锁的线程来缩短此时间
lock table alerts_counter_temp write;
Query OK, 0 rows affected (0.00 sec)
show processlist \G
# 查看锁表之后,所有和此表相关的线程,是否全部处于`State: Waiting for table metadata lock`状态,如未处于此状态,将其kill

3、替换旧表

ALTER TABLE alerts_counter RENAME TO alerts_counter_old, ALGORITHM=INPLACE;
Query OK, 0 rows affected (0.05 sec)
ALTER TABLE alerts_counter_temp RENAME TO alerts_counter, ALGORITHM=INPLACE;
Query OK, 0 rows affected (0.07 sec)

替换完成后,手动添加的锁,已不会影响正常查询,清理业务。

至此,我们已经可以算是将alerts_counter表清理完成了。

4、释放锁

unlock tables;

5、删除旧表

alerts_counter_old可以在空闲时间删除,drop占用资源少,对性能影响小

drop table alerts_counter_old;
Query OK, 0 rows affected (0.08 sec)

3.4、测试说明

  • 本次测试,存在自动清理机制,告警查询机制的影响,通过lock table添加排他锁,阻塞他们,使其处于Waiting for table metadata lock状态,避免影响后续操作。

在这里插入图片描述

  • 加锁期间,将阻塞告警查询、新告警写入的动作,锁释放后继续执行。

  • 清理操作,不会对其他组件的可用性产生影响。

4、结果

最终确定使用drop方案,执行期间只有在加锁时略有延时,通过kill掉阻塞加锁的查询、自动清理线程,快速完成加锁。

总体耗费时间3s左右,成功清除剩余的500多万行数据,期间未对业务产生不良影响。

5、清理脚本

以下仅供参考

5.1、alerts_counter清理

#! /bin/bash

###########
# VARIABLES
###########

MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'

STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out

###########
# functions
###########

# usage: log level msg
function log() {
  echo "`date +'%F %T'` $1 $2"
}

# usage: executeSql sql extra_flags...
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
  sql="$1"
  shift 1
  extraFlags=$@

  connStr="-S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS"

  if ! $MYSQL_CLIENT $connStr $extraFlags -e"$sql" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
    log "ERROR" "Error in executing sql, see details belows: $sql"
    cat $STDERR_LOG
    return 1
  fi

  return 0
}

########
# main()
########

for table in "alerts_counter" ; do

  deleted_rows=0
  deleted_rows_total=0
  total_rows=0
  new_table="${table}_new"
  temp_table="${table}_temp"

  set +o posix

  log "INFO" "Start to clean table ${table}"
  
  # 1. create the new empty table with same structure
  log "INFO" "Start to create the new empty table with same structure"
  if ! executeSql "CREATE TABLE IF NOT EXISTS ${new_table} LIKE ${table}" ; then
    exit 1
  fi
  log "INFO" "Finished creation of the new empty table with same structure"
  
  # 2.1. LOCK-RENAME-UNLOCK
  {
    sql="LOCK TABLES ${table} WRITE, ${new_table} WRITE"
    sql="${sql};ALTER TABLE ${table} RENAME TO ${temp_table}, ALGORITHM=INPLACE"
    sql="${sql};ALTER TABLE ${new_table} RENAME TO ${table}, ALGORITHM=INPLACE"
    log "INFO" "Start to LOCK-RENAME-UNLOCK"
    if ! executeSql "${sql}" ; then
      exit 1
    fi
    log "INFO" "Finished LOCK-RENAME-UNLOCK"
  } &

  # 2.2. kill all running slow queries
  sleep 2
  log "INFO" "Start to kill all running slow quries (SELECT ... LEFT JOIN ... and DELETE w/o LIIMIT)"
  findSlowSql="SELECT CONCAT('KILL QUERY ',ID, ';') FROM information_schema.processlist WHERE"
  findSlowSql="${findSlowSql} INFO LIKE '%LEFT JOIN%'"
  findSlowSql="${findSlowSql} OR ( INFO LIKE 'DELETE%' AND INFO NOT LIKE '%LIMIT%')"
  findSlowSql="${findSlowSql} AND ID != CONNECTION_ID()"
  while read sql ; do
    executeSql "sql" "-N" "-s"
  done < <(executeSql "${findSlowSql}" "-N" "-s")
  log "INFO" "Finished killing all running slow quries (SELECT ... LEFT JOIN ... and DELETE w/o LIIMIT)"

  log "INFO" "Waiting for LOCK-RENAME-UNLOCK to finish"
  wait %1

  # 3. drop useless table
  log "INFO" "Start to drop useless table(${temp_table})"
  if ! executeSql "DROP TABLE IF EXISTS ${temp_table}" ; then
    exit 1
  fi
  log "INFO" "Finished dropping of useless table(${temp_table})"

  log "INFO" "Finished cleaning of table ${table}"

done

5.2、alerts清理

#! /bin/bash

###########
# VARIABLES
###########

MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'

STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out

TIME_CONDITION="timestamp < '2019-10-30 23:59:59'"

###########
# functions
###########

# usage: log level msg
function log() {
  echo "`date +'%F %T'` $1 $2"
}
# usage: executeSql sql
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
  if ! $MYSQL_CLIENT --show-warnings -S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS -e"$1" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
    log "ERROR" "Error in executing sql, see details belows: $sql"
    cat $STDERR_LOG
    return 1
  fi

  return 0
}

########
# main()
########

for table in "alerts" ; do

  deleted_rows=0
  deleted_rows_total=0
  total_rows=0

  log "INFO" "Start to clean table ${table}"
  
  # 1. get total rows
  if ! executeSql "SELECT COUNT(1) FROM ${table} WHERE ${TIME_CONDITION}" ; then
    exit 1
  fi
  total_rows="`sed -nre '2p' $STDOUT_LOG`"
  log "INFO" "Found $total_rows rows in ${table} need to delete."

  # 2. do batch delete repeatly
  while true ; do
    if ! executeSql "DELETE FROM ${table} WHERE ${TIME_CONDITION} LIMIT 100; SELECT ROW_COUNT();" ; then
      exit 1
    fi
    deleted_rows="`sed -nre '2p' $STDOUT_LOG`"
  
    let deleted_rows_total+=deleted_rows
    if (($deleted_rows_total % 1000 > 0 && $deleted_rows != 0)) ; then
      continue
    fi
    log "INFO" "deleted ${deleted_rows} rows, ${deleted_rows_total}/${total_rows} in total"

    if (($deleted_rows == 0)); then
      log "INFO" "finished cleaning of $table"
      break
    fi
  done

done

5.3、循环kill线程

#! /bin/bash

###########
# VARIABLES
###########
MYSQL_CLIENT='/data/mysql/base/bin/mysql'
MYSQL_SOCKET='/data/mysql/data/mysqld.sock'
MYSQL_USER='root'
MYSQL_PASS='password'
MYSQL_DB='DB_name'

STDOUT_LOG=/tmp/std.out
STDERR_LOG=/tmp/err.out

INTERVAL='5'

###########
# functions
###########

# usage: log level msg
function log() {
  echo "`date +'%F %T'` $1 $2"
}

# usage: executeSql sql extra_flags...
# return: 0 for success, others for failure
# remarks: stdout and stderr stores in $STDOUT_LOG and $STDERR_LOG, respectively
function executeSql() {
  sql="$1"
  shift 1
  extraFlags=$@

  connStr="-S $MYSQL_SOCKET -u$MYSQL_USER -p$MYSQL_PASS"

  if ! $MYSQL_CLIENT $connStr $extraFlags -e"$sql" $MYSQL_DB 1>${STDOUT_LOG} 2>$STDERR_LOG ; then
    log "ERROR" "Error in executing sql, see details belows: $sql"
    cat $STDERR_LOG
    return 1
  fi

  return 0
}

########
# main()
########

findSlowSql="SELECT CONCAT('KILL QUERY ',ID, ';') FROM information_schema.processlist WHERE"
findSlowSql="${findSlowSql} INFO LIKE '%LEFT JOIN%'"
findSlowSql="${findSlowSql} OR ( INFO LIKE 'DELETE%' AND INFO NOT LIKE '%LIMIT%')"
findSlowSql="${findSlowSql} AND ID != CONNECTION_ID()"

while sleep $INTERVAL ; do
  log "INFO" "do killing"
  while read sql ; do
    executeSql "sql" "-N" "-s"
  done < <(executeSql "${findSlowSql}" "-N" "-s")
  log "INFO" finished killing
done
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值