简单说明:
本博文不面向Oracle行列转换零基础的筒子,这是一个高级列转行之后的报表实例:
有张表存在冗余数据
id1,url1
id2,url1&url2&url3&url3&url2...
id3,url1&url2&url3&url3&url2...
id4,url1&url4&url5&url5...
现在要根据url进行记录滤重,将重复出现的url对应的id查出来拼接在一起
url是不确定的,可能有上万个url甚至更多,一条记录中的url可能存在重复,相邻的记录之间的url可能是完全重复的
准备测试数据:
create table id_urls(id number,urls varchar2(255));
set define off
insert into ID_URLS values(1,'URL1');
insert into ID_URLS values(2,'URL1');
insert into ID_URLS values(3,'URL1&URL2&URL3');
insert into ID_URLS values(4,'URL1&URL2&URL3');
insert into ID_URLS values(5,'URL2&URL3&URL2');
insert into ID_URLS values(6,'URL2&URL3&URL2');
insert into ID_URLS values(7,'URL6&URL7&URL8&URL9&URL0');
insert into ID_URLS values(8,'URL10');
commit;
set define on
查询推导过程:
1° 将URL替换为空,查看该字段最多有几个URL
select regexp_replace(urls, '[^&]+', '') from ID_URLS;
select length(regexp_replace(urls, '[^&]+', '')) from ID_URLS;
select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS;
-- 最大数量至少要加1,如果URL数量最多的记录以分隔符&为行尾,加1会导致最大URL个数比实际个数大1
-- 求出来的最大URL个数是做记录笛卡尔积的依据
-- 比如当前记录只有一个URL,最大URL个数为10个,那么该行记录会笛卡尔积10次
-- 然后将10次笛卡尔积的结果集根据&符切分出第1组到第10组,会切出9组空记录,滤空即可解决
-- 如果求出的最大URL个数比实际小的话,则会造成切分时丢失数据
-- 比如当前记录有10个URL,但是求出的最大URL个数为5,那么该记录只会笛卡尔积5次
-- 然后将5次笛卡尔积的结果集根据&符切分出第1组到第5组,就会丢失5组数据
-- 因此求出的最大URL个数比实际个数大是可以的,小是不行的
-- 只有判断表中所有记录包含或者不包含以&分隔符结尾的情况才能得出正好相等的最大URL个数
2° 如果要将所有的URL做列传行的还原,至少需要每条记录重复max_len次
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS)
select level as lvl from dual connect by level <= (select max_len from max_length);
3° 每条记录都重复max_len次,列传行切出来所有的URL
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS),
repeat as
(select level as lvl from dual connect by level <= (select max_len from max_length))
select id,regexp_substr(urls,'[^&]+',1,lvl) as single_url from ID_URLS,repeat order by id;
4° 因为并不是所有的记录的URL都有max_len个,所以存在URL为空的记录,过滤掉
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS),
repeat as
(select level as lvl from dual connect by level <= (select max_len from max_length)),
cut_url as
(select id,regexp_substr(urls,'[^&]+',1,lvl) as single_url from ID_URLS,repeat)
select id, single_url from cut_url where single_url is not null order by id;
5° 因为同一条记录的URL有可能存在重复,因此需要滤重
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS),
repeat as
(select level as lvl from dual connect by level <= (select max_len from max_length)),
cut_url as
(select id,regexp_substr(urls,'[^&]+',1,lvl) as single_url from ID_URLS,repeat)
select distinct id, single_url from cut_url where single_url is not null order by id;
6° 查看一下有多少URL重复,每个URL重复了多少次
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS),
repeat as
(select level as lvl from dual connect by level <= (select max_len from max_length)),
cut_url as
(select id,regexp_substr(urls,'[^&]+',1,lvl) as single_url from ID_URLS,repeat),
distinct_record as
(select distinct id, single_url from cut_url where single_url is not null order by id)
select single_url,count(*) from distinct_record where single_url is not null group by single_url having count(*)>1;
7° 将重复的URL对应的ID拼接在一起,丢到需求人脸上
with max_length as
(select max(length(regexp_replace(urls, '[^&]+', ''))) + 1 as max_len from ID_URLS),
repeat as
(select level as lvl from dual connect by level <= (select max_len from max_length)),
cut_url as
(select id,regexp_substr(urls,'[^&]+',1,lvl) as single_url from ID_URLS,repeat),
distinct_record as
(select distinct id, single_url from cut_url where single_url is not null order by id)
select single_url,to_char(wm_concat(id)) from distinct_record where single_url is not null group by single_url having count(*)>1;
收尾:
需求产生原因:
这是一个真实的生产需求,该表存的URL是图片URL
用户前台页面上传图片,一次上传的图片可能是多个
图片上传到第三方,第三方将上传的图片存储的URL返回给后台
因为网络原因或者第三方的BUG,造成多次将用户的一次上传行为返回给后台
且返回信息中的图片URL存在重复
后台未做处理直接将这些图片URL入库,造成了本次需求的产生
解决办法:
后台将收到的图片URL存入Redis,一个图片URL存一个KEY,一分钟超时
当收到图片URL时检索Redis中的当前一分钟内的图片URL,如果存在则不再重复入库
如果不存在,新建该图片URL的KEY,设置一分钟超时,入库
题外话:
从解决办法上可以明显看出锅到底是谁的
该难度的SQL业务开发无法写出(他们少点BUG就谢天谢地了)
普通的维护向DBA无法写出,专司BI报表的DBA或SQL开发能够写出
但是作为DBA,应当追求熟练揉捏手中数据的能力
[TOC]