无索引如何删除亿级数据
一 业务需求
某业务表a
数据量大约4.7亿,单表物理大小为370G,其中某些指定xxid='xxx’值的记录大约2亿。受限于磁盘空间比较紧张,需要对在无索引的情况下删除无效数据。如何优雅的删除呢?
二 思路
2.1 xxid本身有索引
存在索引的情况下就比较简单,直接利用索引进行删除,写一个for 循环语句 每次删除500行,每次判断delete 影响的行数可以累加计算删除了多少行,直到删除结果为0行。
delete from a where ‘xxid’=‘xxx’ limit 500
如果要求不能创建索引怎么处理?
2.2 xxid 字段无索引
因为表占用的空间已经比较大 370G ,再添加索引会更大。因为没有索引,故我们不能直接像方法一 那样 根据 where xxxid=‘xxx’ 删除数据,那样更慢,可能会引发故障。
我们采取分而治之的方式,基于主键把表的数据分段,比如每段1000行-2000行(如果主键id不连续 则实际数据量会小于指定分段数据)。然后在这1000行里面删除指定的数据,这样delete的执行效率会比直接依赖 xxxid=‘xxx’ 好很多。
1 select min(a.id) min_id,max(a.id) max_id from(selectid from a where id>{init_id} order by id limit 1000) a
2 delete from a where xxid=‘xxx’ and id >=min_id and id <=max_id
3 init_id = max_id
代码如下:
def get_current_max_id ( ) : """
获取当前最大的id
:return:
""" get_max_id = """select max(a.id) max_id from a""" try : mydb = pymysql.connect (
HOST = IP,
PORT = INT ( PORT ),
USER = USER,
read_timeout = 5,
write_timeout = 5,
charset = 'utf8',
autocommit = TRUE
) CURSOR = mydb.CURSOR ( pymysql.cursors.DictCursor ) CURSOR.EXECUTE ( get_max_id ) DATA = CURSOR.fetchall ( ) except
Exception AS e : print traceback.format_exc ( e ) EXIT ( 0 ) finally : mydb.CLOSE ( ) print "we get max id of table : %s" % ( DATA [ 0 ][ 'max_id' ] ) RETURN DATA [ 0 ][ 'max_id' ] def get_min_max_id ( min_id ) : """
:param min_id:
:return:
""" get_ids = """select min(a.id) min_id,max(a.id) max_id from
(select id from a where id>{init_id} order by id limit 2000) a
""".format ( init_id = min_id ) try : mydb = pymysql.connect (
HOST = IP,
PORT = INT ( PORT ),
USER = USER,
read_timeout = 5,
write_timeout = 5,
charset = 'utf8',
DATABASE = 'test',
autocommit = TRUE
) CURSOR = mydb.CURSOR ( pymysql.cursors.DictCursor ) CURSOR.EXECUTE ( get_ids ) DATA = CURSOR.fetchall ( ) except
Exception AS e : print traceback.format_exc ( e ) EXIT ( 0 ) finally : mydb.CLOSE ( ) RETURN DATA [ 0 ][ 'min_id' ],
DATA [ 0 ][ 'max_id' ] def del_tokens ( min_id, max_id ) : """
:param min_id:
:param max_id:
:return:
""" del_token = """delete from a
where client_id in ('xxx','yyy') and id>=%s and id<=%s """ try : mydb = pymysql.connect (
HOST = IP,
PORT = INT ( PORT ),
USER = USER,
read_timeout = 5,
write_timeout = 5,
charset = 'utf8',
DATABASE = 'test',
autocommit = TRUE
) CURSOR = mydb.CURSOR ( pymysql.cursors.DictCursor ) rows = CURSOR.EXECUTE ( del_token, ( min_id, max_id ) ) except
Exception AS e : print traceback.format_exc ( e ) EXIT ( 0 ) finally : mydb.CLOSE ( ) RETURN rows def get_last_del_id ( file_name ) :
IF
NOT os.path.EXISTS ( file_name ) : print "{file} is not exist ,exit .".format ( file = file_name ) EXIT ( - 1 ) WITH OPEN ( file_name, 'r' ) AS fh : del_id = fh.readline ( ).strip ( )
IF
NOT del_id.isdigit ( ) : print "it is '{delid}', not a num , exit ".format ( delid = del_id ) EXIT ( - 1 ) RETURN del_id def main ( ) : file_name = '/tmp/del_aid.id' rows_deleted = 0 maxid = get_current_max_id ( ) init_id = get_last_del_id ( file_name )
WHILE
TRUE : min_id,
max_id = get_min_max_id ( init_id )
IF
max_id > maxid : WITH OPEN ( '/tmp/del_aid.id', 'w' ) AS f : f.WRITE ( str ( min_id ) ) print "delete end at : {end_id}".format ( end_id = init_id ) EXIT ( 0 ) rows = del_tokens ( INT ( min_id ), INT ( max_id ) ) init_id = max_id rows_deleted += rows print "delete at %d ,and we have deleted %d rows " % ( max_id, rows_deleted ) time.sleep ( 0.3 ) ### 可以控制每秒删除的速度
IF
__name__ == '__main__' : main ( )
这个脚本可以记录上一次的id,用上一次id 作为 init_id进行删除。第一次使用的时候需要手工初始化/tmp/del_aid.id 比如写入 0 或者符合条件的最小主键 id。