分布式集群下数据可见性问题分析记录
1. 问题背景介绍
AntDB 3.1在做benchmarksql压测时,一直有个core down 问题。起初一直以为是触发器问题,因触发器的逻辑不太熟悉,一直先紧着别的问题修正。最近定位这个问题时,根据core文件的堆栈信息发现,是foreign key检查的问题。
可通过如下脚本复现该问题。
1-1. 集群搭建
为了简化测试环境,排除其他干扰项,此处搭建一个1C1D的AntDB集群。
postgres=# table pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+-----------
cd2 | C | 38000 | 169.254.3.127 | f | f | 455674804
dn1 | D | 39001 | 169.254.3.127 | f | f | -560021589
(2 rows)
1-2. 建表SQL
创建复制表test, testa,其中test(sid)外键关联testa的id字段。
drop table if exists testa, test;
create table testa(id serial primary key, sex varchar(2)) DISTRIBUTE BY REPLICATION;
insert into testa(sex) values('m');
create table test(id serial primary key, name varchar(20), sid integer references testa on delete cascade) DISTRIBUTE BY REPLICATION;
1-2. 测试脚本
#!/bin/bash
for((i=0;i<=1000;i++));
do
psql -X -p 38000 -U postgres <<EOF
begin;
insert into testa(sex) values('m');
update testa set sex = 'f' where id = 1 returning *;
insert into test(name, sid) values('zl', 1) returning *;
end;
EOF
done
1-3. 测试方法
后台同时运行2次以上脚本,一段时间(很快,约1min)后即可在dn1上产生core文件。
[dev@CentOS-7 ~/workspace_adb31]$ sh test.sh >test.result 2>&1 &
[1] 27726
[dev@CentOS-7 ~/workspace_adb31]$ sh test.sh >test.result 2>&1 &
[2] 27748
[dev@CentOS-7 ~/workspace_adb31]$ sh test.sh >test.result 2>&1 &
[3] 29055
1-4. core 文件堆栈
[dev@CentOS-7 ~/workspace_adb31]$ gdb postgres /data/dev/adb31/dn/dn1/core.22968
(gdb) bt
#0 0x00007fe831fd71f7 in raise () from /lib64/libc.so.6
#1 0x00007fe831fd88e8 in abort () from /lib64/libc.so.6
#2 0x0000000000aa730d in ExceptionalCondition (conditionName=0xd41db0 "!(CritSectionCount > 0 || TransactionIdIsCurrentTransactionId(( (!((tup)->t_infomask & 0x0800) && ((tup)->t_infomask & 0x1000) && !((tup)->t_infomask & 0x0080)) ? HeapTupleGetUpdateXid(tup) : ( (tup)-"..., errorType=0xd41cc4 "FailedAssertion", fileName=0xd41c80 "/home/dev/workspace_adb31/adb_sql/src/backend/utils/time/combocid.c", lineNumber=132) at /home/dev/workspace_adb31/adb_sql/src/backend/utils/error/assert.c:54
#3 0x0000000000af1cad in HeapTupleHeaderGetCmax (tup=0x7fe819dfd360) at /home/dev/workspace_adb31/adb_sql/src/backend/utils/time/combocid.c:131
#4 0x00000000004de247 in heap_lock_tuple (relation=0x7fe833bf3aa0, tuple=0x7fff69b0dc10, cid=3, mode=LockTupleKeyShare, wait_policy=LockWaitBlock, follow_updates=1 '\001', buffer=0x7fff69b0dc58, hufd=0x7fff69b0dc40) at /home/dev/workspace_adb31/adb_sql/src/backend/access/heap/heapam.c:5100
#5 0x00000000006ebe94 in ExecLockRows (node=0x1fe9950) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/nodeLockRows.c:179
#6 0x00000000006c8577 in ExecProcNode (node=0x1fe9950) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/execProcnode.c:585
#7 0x00000000006c4436 in ExecutePlan (estate=0x1fe9748, planstate=0x1fe9950, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=1, direction=ForwardScanDirection, dest=0x101de40 <spi_printtupDR>) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/execMain.c:1624
#8 0x00000000006c2379 in standard_ExecutorRun (queryDesc=0x1fe7798, direction=ForwardScanDirection, count=1) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/execMain.c:352
#9 0x00000000006c21e8 in ExecutorRun (queryDesc=0x1fe7798, direction=ForwardScanDirection, count=1) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/execMain.c:297
#10 0x000000000070a5a5 in _SPI_pquery (queryDesc=0x1fe7798, fire_triggers=0 '\000', tcount=1) at /home/dev/workspace_adb31/adb_sql/src/backend/executor/sp