PostgreSQL运维案例--check约束超长导致pg_pathman分区表无法查询

一、 Pathman简介

由于以前PostgreSQL社区版本的分区表功能比较弱,需要通过继承和初始化或RULE来实现分区表的功能,查询和更新涉及约束的检查,插入则涉及转换或规则重构,导致分区功能性能较弱差。Postgrespro公司开发了pg_pathman插件,适用于9.5及之后的版本,与传统方式不同的是,pg_pathman将分区的定义放置在一张元数据表中,表的信息会缓存在内存中,同时使用HOOK来实现关系的替换,所以效率非常高。

二、 问题背景

1.版本信息
Postgresql版本:9.6.6
Pg_Pathman版本:1.4
2.问题现象
一测试库查询报错如下:

postgres=# select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;
ERROR:  constraint "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" of partition "qsump_pacloud_oscginfo_activity_detail_info_day_10" does not exist
HINT:  pg_pathman will be disabled to allow you to resolve this issue

三、 问题根因

由于表名较长,存储在pg_constraint表中的约束名不完整。因此对表查询时pathman拼接出的约束名,和pg_constraint表中的不一致,抛出错误。

补充其他类似场景:近期又遇到了另外一个场景,就是不同pathman版本数据库间数据迁移(1.3数据迁移至1.4或以后版本),出现同样的报错。

用户是直接把1.3版本的pathman分区表,导入到1.5版本pathman的数据库中。

原因就是1.3版本的代码中生成的约束名,相比1.4及以后版本多了%u (实际为attno),因此在pg_constraint中存储的是"pathman_%s_%u_check",而查表时pathman拼接的为"pathman_%s_check",不一致所以报错了。

最简单的解决办法就是将从1.3版本导出的表结构文件进行正则匹配,去除掉约束名中的u%

例如:cat pgdump.txt| sed ‘s/[0-9]_check/check/g’ | sed ‘s/[0-9]check/check/g’ > pgdump_new.txt

代码:
pg_pathman 1.3

char *
build_check_constraint_name_relname_internal(const char *relname, AttrNumber attno)
{
	return psprintf("pathman_%s_%u_check", relname, attno);
}

pg_pathman 1.4

/*
 * Generate check constraint name for a partition.
 * NOTE: this function does not perform sanity checks at all.
 */
char *
build_check_constraint_name_relname_internal(const char *relname)
{
	return psprintf("pathman_%s_check", relname);
}

四、 分析过程
  1. 这个报错从字面解释来看,是分区表的约束不存在。查询pg_constraint表,发现通过报错中的约束名可以查到相关信息
postgres=# select conname,contype,convalidated from pg_constraint where conname='pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check';
-[ RECORD 1 ]+---------------------------------------------------------
conname      | pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec
contype      | c
convalidated | t	

  1. 可以发现查询结果的conname和报错中的相比,末尾的check缺少了字母k。为什么会出现这样的情况?这个只能在源代码中寻找答案
  2. 报错函数如下:
get_partition_constraint_expr(Oid partition)
{
	Oid			conid;			/* constraint Oid */
	char	   *conname;		/* constraint name */
	HeapTuple	con_tuple;
	Datum		conbin_datum;
	bool		conbin_isnull;
	Expr	   *expr;			/* expression tree for constraint */
	conname = build_check_constraint_name_relid_internal(partition); //“拼接”conname
	conid = get_relation_constraint_oid(partition, conname, true);//从pg_constrain中查找conname是否存在
	/*若不存在则进入报错分支*/
	if (!OidIsValid(conid)) 
	{
		DisablePathman(); /* disable pg_pathman since config is broken */
		ereport(ERROR,
				(errmsg("constraint \"%s\" of partition \"%s\" does not exist",
						conname, get_rel_name_or_relid(partition)),
				 errhint(INIT_ERROR_HINT)));
	}}

显然是OidIsValid(conid)函数返回了false,才进入报错分支
OidIsValid函数定义为:#define OidIsValid(objectId) ((bool) ((objectId) != InvalidOid)),因此推断conid=InvalidOid

  1. 而conid是函数**get_relation_constraint_oid(partition, conname, true)的返回值,其中conname是build_check_constraint_name_relid_internal(partition)**的返回值,需要分析这两个函数逻辑以及变量的值

首先分析conname:

char *
build_check_constraint_name_relid_internal(Oid relid)
{
	AssertArg(OidIsValid(relid));
return build_check_constraint_name_relname_internal(get_rel_name(relid));
}

/*
 * Generate check constraint name for a partition.
 * NOTE: this function does not perform sanity checks at all.
 */
char *
build_check_constraint_name_relname_internal(const char *relname)
{
    /*拼接conname*/
	return psprintf("pathman_%s_check", relname);
}

可以看到conname是在build_check_constraint_name_relname_internal函数中拼接的,%s传递的是表名,从报错来看,拼接的conname为:“pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check

接着看conid:

Oid
get_relation_constraint_oid(Oid relid, const char *conname, bool missing_ok)
{
	Relation	pg_constraint;
	HeapTuple	tuple;
	SysScanDesc scan;
	ScanKeyData skey[1];
	Oid			conOid = InvalidOid;   //conOid初始值为InvalidOid

	/*
	 * Fetch the constraint tuple from pg_constraint.  There may be more than
	 * one match, because constraints are not required to have unique names;
	 * if so, error out.
	 */
	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);    

	ScanKeyInit(&skey[0],
				Anum_pg_constraint_conrelid,
				BTEqualStrategyNumber, F_OIDEQ,
				ObjectIdGetDatum(relid));

	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
							  NULL, 1, skey); //读取pg_constraint

	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
	{
		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
        /*比较从pg_constraint中获取的conname和上一步拼接传入的conname是否一致*/
        /*当conname一致时,也就是说存在这个constraint,获取其conOid*/
		if (strcmp(NameStr(con->conname), conname) == 0)
		{
			if (OidIsValid(conOid))
				ereport(ERROR,
						(errcode(ERRCODE_DUPLICATE_OBJECT),
				 errmsg("table \"%s\" has multiple constraints named \"%s\"",
						get_rel_name(relid), conname)));
			conOid = HeapTupleGetOid(tuple);
		}
	}

	systable_endscan(scan);
     
	/* If no such constraint exists, complain */
	/*当conname不一致时,conOid还为初始值InvalidOid*/
	if (!OidIsValid(conOid) && !missing_ok)
		ereport(ERROR,
				(errcode(ERRCODE_UNDEFINED_OBJECT),
				 errmsg("constraint \"%s\" for table \"%s\" does not exist",
						conname, get_rel_name(relid))));

	heap_close(pg_constraint, AccessShareLock);

	return conOid;        //返回conOid的值
}

可以看到conoid的初值为InvalidOid,然后从pg_constraint系统表里获取到的conname和上一步传入的conname不一致,最终conoid没有经过赋值,函数返回了InvalidOid,因此最后导致了报错。刚才已经发现pg_constraint中查询到的conname为:
pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec确实如此,pg_constraint中存储的conname是个错误值,导致报错发生。

  1. 看似是conname在pg_constraint中被“截断了”,也就是说,应该是该conname超过了表pg_constraint定义的最大长度。
NameData	conname;		/* name of this constraint */

typedef struct nameData
{
	char		data[NAMEDATALEN];
} NameData;
typedef NameData *Name;

#define NAMEDATALEN 64

postgres=# select char_length('pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec');
 char_length
-------------
          63
(1 row)

可以看到conname的最大长度为64,而pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec的长度为63,再加结束符‘\0’,刚好是64。

  1. 初步的代码走读,证明了是约束名conname长度超过字段最大长度了,导致存储的约束名不完整,在查询校验约束时报错了。后边的GDB调试也证实了确实是这样,见文章最后。
五、 解决方案
1. 方案讨论

1)约束名过长是由于表名太长,可以压缩一下表名,将某些地方缩写。这个方案看起来最简单快捷,但是开发兄弟不一定认可接受,可能对于整个应用来说并不是修改一个表名那么简单
站在应用的角度,可以认为是数据库给定的字段长度不够,那么就得考虑是否可以从数据库自身去修改:
2)修改字段最大长度,从64调整为128,但是 NAMEDATALEN
这个宏,涉及到所有的表、函数、触发器等对象的名称,牵一发则动全身,这个方案风险大,不可行。
3)继续走读代码,发现约束名是在创建分区表的同时创建的,也是通过调用build_check_constraint_name_relname_internal函数,拼接为pathman_%s_check,%s为表名。不难发现仅pathman和check合起来就占了12个字符长度,为了保证可读性check保持不变,可以将pathman缩写为pm,这样能够节省出5个字符长度,那么以现在的表名长度,可以容纳99万个分区表,足够了。
在这里插入图片描述

2. 方案实施

经过讨论,考虑到改动数据库的风险较大,最终还是选择了压缩表名,把几个单词全拼做了简写,并重建了分区。

3. 其他测试

作为dba,数据库侧的修改,我们还是要自己玩一玩的,验证下其余两个方案
方案2)修改后查询成功,未出现报错。篇幅有限这里不讨论细节
方案3)如下:
修改代码

/*
 * Generate check constraint name for a partition.
 * NOTE: this function does not perform sanity checks at all.
 */
char *
build_check_constraint_name_relname_internal(const char *relname)
{
        /* Modified Begin pathman to pm */
        return psprintf("pm_%s_check", relname);
        /* End 2020-01-10 */
}

编译
cd $pg_pathmansrcdir
Make && make install

登录数据库执行 drop extension pg_pathman ;

重启数据库并create extension pg_pathman ;

创建分区表,插入数据,查询成功未报错

postgres=# create table qsump_pacloud_oscginfo_activity_detail_info_day(id int primary key,add_time timestamp without time zone not null);
CREATE TABLE
postgres=# select create_range_partitions('qsump_pacloud_oscginfo_activity_detail_info_day'::REGCLASS,
postgres(#                         'add_time',
postgres(#                         '2020-01-01 00:00:00'::timestamp without time zone  ,
postgres(#                          interval  '7 days'      ,
postgres(#                         9,
postgres(#                         true  );
 create_range_partitions
-------------------------
                       9
(1 row)

postgres=# insert into qsump_pacloud_oscginfo_activity_detail_info_day values (10,'2020-03-12 00:00:00'::timestamp without time zone);
INSERT 0 1
postgres=# select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;
 id |      add_time
----+---------------------
 10 | 2020-03-12 00:00:00
(1 row)

postgres=# \d+ qsump_pacloud_oscginfo_activity_detail_info_day_10
             Table "public.qsump_pacloud_oscginfo_activity_detail_info_day_10"
  Column  |            Type             | Modifiers | Storage | Stats target | Description
----------+-----------------------------+-----------+---------+--------------+-------------
 id       | integer                     | not null  | plain   |              |
 add_time | timestamp without time zone | not null  | plain   |              |
Indexes:
    "qsump_pacloud_oscginfo_activity_detail_info_day_10_pkey" PRIMARY KEY, btree (id)
Check constraints:
    "pm_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" CHECK (add_time >= '2020-03-04 00:00:00'::timestamp without time zone AND add_time < '2020-03-11 00:00:00'::timestamp without time zone)
Inherits: qsump_pacloud_oscginfo_activity_detail_info_day

postgres=# select char_length('pm_qsump_pacloud_oscginfo_activity_detail_info_day_10_check');
 char_length
-------------
          59
(1 row)

postgres=#

六、 GDB调试

Session 1:

建立一个连接,查询pid

[postgres@postgres_zabbix ~]$ psql
psql (9.6.6)
Type "help" for help.
postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
           4240

Session 2:

gdb调试session
[postgres@postgres_zabbix ~]$ gdb
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
(gdb) attach 4240                                         //attach session 1      
Attaching to process 4240
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Loaded symbols for /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00007f995cd6a5e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) b get_partition_constraint_expr                          //设置断点
Breakpoint 1 at 0x7f99562fc3f0: file src/relation_info.c, line 1283.
(gdb) b get_relation_constraint_oid                            //设置断点
Breakpoint 2 at 0x544991: file pg_constraint.c, line 771.
Session 2:
postgres=#  select * from qsump_pacloud_oscginfo_activity_detail_info_day where id=10;                                                   //执行查询
session 1(gdb) n                                                        //开始调试
//省略了一些过程,从第一个分区表开始
Breakpoint 1, get_partition_constraint_expr (partition=16589) at src/relation_info.c:1283
1283            conname = build_check_constraint_name_relid_internal(partition);
(gdb)
1284            conid = get_relation_constraint_oid(partition, conname, true);
(gdb)
Breakpoint 2, get_relation_constraint_oid (relid=16589,
    conname=0x13997e8 "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_1_check", missing_ok=1 '\001') at pg_constraint.c:771                //传入拼接的conname
771             Oid                     conOid = InvalidOid;
(gdb)
778             pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
(gdb)
780             ScanKeyInit(&skey[0],
(gdb)
785             scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);             //获取pg_constraint中的数据
(gdb)
(gdb) n
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb) p *con
$2 = {conname = {data = "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_1_check"}, connamespace = 2200, contype = 99 'c',  condeferrable = 0 '\000', condeferred = 0 '\000', convalidated = 1 '\001', conrelid = 16589, contypid = 0, conindid = 0,
  confrelid = 0, confupdtype = 32 ' ', confdeltype = 32 ' ', confmatchtype = 32 ' ', conislocal = 1 '\001', coninhcount = 0,
  connoinherit = 0 '\000'} //pg_constraint中的conname和查询时拼接的一致

(gdb) n
794                             if (OidIsValid(conOid))
(gdb) n
799                             conOid = HeapTupleGetOid(tuple);
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
803             systable_endscan(scan);
(gdb)
806             if (!OidIsValid(conOid) && !missing_ok)
(gdb)
812             heap_close(pg_constraint, AccessShareLock);
(gdb)
814             return conOid;
(gdb) p conOid
$5 = 16788                                      //返回的conoid为16788
//省略一些过程,直接看报错的这个分区表
(gdb)
Breakpoint 1, get_partition_constraint_expr (partition=16643) at src/relation_info.c:1283
1283            conname = build_check_constraint_name_relid_internal(partition);
(gdb)
1284            conid = get_relation_constraint_oid(partition, conname, true);
(gdb)

Breakpoint 2, get_relation_constraint_oid (relid=16643,
    conname=0x13997e8 "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check", missing_ok=1 '\001') at pg_constraint.c:771                //传入拼接的conname
771             Oid                     conOid = InvalidOid;
(gdb)
778             pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
(gdb)
780             ScanKeyInit(&skey[0],
(gdb)
785             scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);                                 //读取pg_constraint的数据
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb) p *con
$4 = {conname = {data = "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_chec"}, connamespace = 2200, contype = 99 'c',  condeferrable = 0 '\000', condeferred = 0 '\000', convalidated = 1 '\001', conrelid = 16643, contypid = 0, conindid = 0,
  confrelid = 0, confupdtype = 32 ' ', confdeltype = 32 ' ', confmatchtype = 32 ' ', conislocal = 1 '\001', coninhcount = 0,
  connoinherit = 0 '\000'}//pg_constraint中读取的conname和查询时拼接的不同
(gdb) n
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
790                     Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
(gdb)
792                     if (strcmp(NameStr(con->conname), conname) == 0)
(gdb)
788             while (HeapTupleIsValid(tuple = systable_getnext(scan)))
(gdb)
803             systable_endscan(scan);
(gdb)
806             if (!OidIsValid(conOid) && !missing_ok)
(gdb)
812             heap_close(pg_constraint, AccessShareLock);
(gdb)
814             return conOid;
(gdb) p conOid
$5 = 0                                     //因此返回的conoid为0即InvalidOid
(gdb) n
815     }
(gdb)

Session 2:

ERROR:  constraint "pathman_qsump_pacloud_oscginfo_activity_detail_info_day_10_check" of partition "qsump_pacloud_oscginfo_activity_detail_info_day_10" does not exist
HINT:  pg_pathman will be disabled to allow you to resolve this issue
postgres=#

这时session2已经报错了,以上gdb过程,证实了之前的代码分析。

参考:

https://github.com/digoal/blog/blob/d7336aeb9fc9cc82714189f16d67d22e47f9d369/201610/20161024_01.md

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值