问题背景:
在批量从ORACLE迁移到DWS的场景中,经常会结构迁移过程中,只迁移了表结构而没有根据ORACLE源库情况同时设置分布列字段的问题,这样会导致所有的表均自动使用第一个字段(可用作为分布列类型)作为默认分布列,但是很多时候使用默认第一个字段作为分布列会导致严重的数据倾斜,这个时候就需要批量修改分布列字段,但是DWS并不支持直接修改分布列字段,手工去修改每个表会非常麻烦。
Hash分布表的分布列设置不合理,会导致严重的数据倾斜,进而导致查询性能严重劣化,而且个别dn性能下降造成的短板效应会阻塞整个集群的计算能力明显下降。
从实际经验来看,对ORACLE进行整库迁移以后,使用源库的主键和唯一健作为分布列字段是一个比较好的方法,能够快速且相对比较合理的解决大部分表的数据倾斜问题。
首先,将ORACLE源库的主键或唯一健导出:
expdp数据泵导出索引方法:
expdp sys/passwd directory=data_dir dumpfile=expdp.dmp CONTENT=metadata_only include=index sqlfile=expidx.sql
impdp sys/passwd directory=data_dir dumpfile=expdp.dmp include=index sqlfile=expidx.sql
生成导出SQL
-- CONNECT SYSTEM
ALTER SESSION SET EVENTS '10150 TRACE NAME CONTEXT FOREVER, LEVEL 1';
ALTER SESSION SET EVENTS '10904 TRACE NAME CONTEXT FOREVER, LEVEL 1';
ALTER SESSION SET EVENTS '25475 TRACE NAME CONTEXT FOREVER, LEVEL 1';
ALTER SESSION SET EVENTS '10407 TRACE NAME CONTEXT FOREVER, LEVEL 1';
ALTER SESSION SET EVENTS '10851 TRACE NAME CONTEXT FOREVER, LEVEL 1';
ALTER SESSION SET EVENTS '22830 TRACE NAME CONTEXT FOREVER, LEVEL 192 ';
-- new object type path: SCHEMA_EXPORT/TABLE/INDEX/INDEX
-- CONNECT SAPSR3
CREATE UNIQUE INDEX "SAPSR3"."S031~0" ON "SAPSR3"."S031" ("MANDT", "SSOUR", "VRSIO", "SPMON", "SPTAG", "SPWOC", "SPBUP", "WERKS", "MATNR", "LGORT")
PCTFREE 10 INITRANS 2 MAXTRANS 255
TABLESPACE "PSAPSR3" PARALLEL 1 ;
ALTER INDEX "SAPSR3"."S031~0" NOPARALLEL;
CREATE UNIQUE INDEX "SAPSR3"."ZPS00T_ZSBZCXX~0" ON "SAPSR3"."ZPS00T_ZSBZCXX" ("MANDT", "PSPID", "TAB_YEAR", "TAB_MONTH", "EQUNR", "ZFSRQ", "ZFSSJ")
PCTFREE 10 INITRANS 2 MAXTRANS 255
TABLESPACE "PSAPSR3USR" PARALLEL 1 ;
通过shell脚本简单处理:
cat expidx.sql |grep "CREATE UNIQUE INDEX" |sed 's/"//ig'|awk 'BEGIN{FS=" ON "}{print tolower($2)}'|awk '{f=$1;$1="";print f"#"$0}'|awk -F "." '{print $1"#"$2}'|sed 's/[ |(|)]//ig' > idx.txt
生成主键映射表数据:
sapsr3#s031#mandt,ssour,vrsio,spmon,sptag,spwoc,spbup,werks,matnr,lgort
sapsr3#zps00t_zsbzcxx#mandt,pspid,tab_year,tab_month,equnr,zfsrq,zfssj
sapsr3#zps13t_ghjh_zhjh#mandt,zpspid
sapsr3#zps18t0001#mandt,pspid,posid
sapsr3#zps18t0002#mandt,bednr
在DWS中创建映射表:
SET search_path = public;
CREATE TABLE tab_uniq_key (
schema_name text,
table_name text,
uniq_key text
)
WITH (orientation=row, compression=no)
DISTRIBUTE BY HASH(schema_name, table_name)
导入映射表(delimiter为#):
接下来通过下面的SQL,就可以直接生成批量整改SQL:
select
schema_name
,table_name
,uniq_key
,hash_key
,hash_key_upper
,uniq_key_upper
,'START TRANSACTION;'||
chr(10)||'CREATE SCHEMA '||schema_name||'_tmp;'||
chr(10)||'SET search_path='||schema_name||'_tmp;'||
chr(10)||replace(replace(tabledef,'DISTRIBUTE BY HASH('||
case
when hash_key=lower(hash_key)
then hash_key
else hash_key_upper end
||')','DISTRIBUTE BY HASH('|| case when hash_key=lower(hash_key) then uniq_key else uniq_key_upper end||')'),'SET search_path = ','--SET search_path = ')||
chr(10)||'INSERT INTO '||schema_name||'_tmp."'||table_name||'" SELECT * FROM '||schema_name||'."'||table_name||'";'||
chr(10)||'DROP TABLE '||schema_name||'."'||table_name||'";'||
chr(10)||'ALTER TABLE '||schema_name||'_tmp."'||table_name||'" SET SCHEMA '||schema_name||';'||
chr(10)||'DROP SCHEMA '||schema_name||'_tmp CASCADE;'||
chr(10)||'COMMIT;' as tab_def
from
(
select
n.nspname as schema_name
,c.relname as table_name
,getdistributekey(c.oid) hash_key
,u.uniq_key
,'"'||replace(replace(upper(uniq_key),'"',''),',','","')||'"' uniq_key_upper
,'"'||replace(replace(getdistributekey(c.oid),'"',''),',','","')||'"' hash_key_upper
,pg_get_tabledef(c.oid) tabledef
from
pg_class c
,pg_namespace n
,tab_uniq_key u
where
c.relnamespace = n.oid
and lower(n.nspname) = lower(u.schema_name)
and lower(c.relname) = lower(u.table_name)
and c.relkind = 'r'
and n.nspowner != '10'
)
where
replace(replace (lower(hash_key),'"' , '' ),' ','') != replace ( REPLACE(lower(uniq_key),'"' , ''),' ','' )
order by schema_name,table_name;
查询结果中tab_def字段即为批量整改SQL语句