PostgreSQL: 如何获取一维数组的相同元素并根据相似度排序

最新推荐文章于 2024-05-28 09:07:56 发布

jamst8522127

最新推荐文章于 2024-05-28 09:07:56 发布

阅读量1.2k

点赞数

分类专栏：大数据行业信息文章标签：数据库 runtime

本文链接：https://blog.csdn.net/jamst8522127/article/details/84628414

版权

大数据行业信息专栏收录该内容

20 篇文章 0 订阅

订阅专栏

PostgreSQL: 如何获取一维数组的相同元素并根据相似度排序

2011-11-02 22:24:09| 分类： Postgres基础 |举报|字号订阅

今天开发有个需求，表中有一个列为一维数组类型，现在需要找出表中具有相同元素的数据，描述起来
可能有点费力，下面举个例子就明白了。

一需求演示
--1.1测试表
mydb=> \d test_array;
Table "mydb.test_array"
Column | Type | Modifiers
--------+----------+-----------
id | integer |
phone | bigint[] |

mydb=> select * from test_array;
id | phone
----+-------------
1 | {1,2}
2 | {1,2,3}
3 | {2,3}
4 | {1,2,3,4}
5 | {1,2,3,4,5}
6 | {4,5,6}

备注: 给出一个 id, 然后找出与这个 id 对应的 phone 数组含有相同元素的记录，相同的元素越多，我们
就认为这两个元素越相似，并根据相似度降序排序。

--1.2 找出与 id=1 的 phone 数组含有相同的元素的记录
mydb=> select id,phone from test_array where phone && (select phone from test_array where id=1) and id!=1;
id | phone
----+-------------
2 | {1,2,3}
3 | {2,3}
4 | {1,2,3,4}
5 | {1,2,3,4,5}

备注：上面SQL虽然能成功找出具有相同元素的记录，但是不能根据相似度排序，今天总结了下，
有以下方法实现上面功能。

方法一：使用intarray模块比较 int4[] 的数组类型

--2.1 安装 intarray 模块
mydb=# create extension intarray;
CREATE EXTENSION

备注：intarray模块里有个 "&" 函数，可以找到数组元素的相同部分，具体信息可查阅手册
http://www.postgresql.org/docs/9.1/static/intarray.html

--2.2 & 操作符使用
mydb=> select array[1,2,3] & array[1,2];
?column?
----------
{1,2}
(1 row)

--2.3 不支持 int8 类型的数组
mydb=> select array[11111111111,2,3] & array[11111111111,2];
ERROR: operator does not exist: bigint[] & bigint[]
LINE 1: select array[11111111111,2,3] & array[11111111111,2];
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.

备注：intarray 模块虽然能比较并获得数组的相同元素，但仅支持 int4 数组类型。

--2.4 源码 & 操作符代码
CREATE OPERATOR & (
LEFTARG = _int4,
RIGHTARG = _int4,
COMMUTATOR = &,
PROCEDURE = _int_inter
);
备注：可以在 $PGHOME/share/extension 目录下查阅 intarray--1.0.sql 文件。

方法二：创建 intersection 函数，对 int8[] 数组类型进行比较
--3.1 create function
CREATE OR REPLACE FUNCTION intersection(anyarray, anyarray) RETURNS anyarray as $$
SELECT ARRAY(
SELECT $1[i]
FROM generate_series( array_lower($1, 1), array_upper($1, 1) ) i
WHERE ARRAY[$1[i]] && $2
);
$$ language sql;

备注：这里我们开发组的一名同事找到的，感谢这位同事。

--3.2 测试
mydb=> select intersection(array[11111111111,2,3],array[11111111111,2,3]);
intersection
-------------------
{11111111111,2,3}
(1 row)

备注：这次果然没报错了，这种方法虽然功能实现了，但效率如何呢？下面简单测试下。

四性能测试
--4.1创建测试表并插入数据
mydb=> create table array_test (skyid serial primary key,phone_list int8[]);
NOTICE: CREATE TABLE will create implicit sequence "array_test_skyid_seq" for serial column "array_test.skyid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "array_test_pkey" for table "array_test"
CREATE TABLE

mydb=> insert into array_test(phone_list) select regexp_split_to_array(id1||';'||id2||';'||id3||';'||id4,';')::int8[] from phone ;
INSERT 0 100000

mydb=> select * from array_test limit 10;
skyid | phone_list
-------+---------------
1 | {1,2,3,4}
2 | {2,3,4,5}
3 | {3,4,5,6}
4 | {4,5,6,7}
5 | {5,6,7,8}
6 | {6,7,8,9}
7 | {7,8,9,10}
8 | {8,9,10,11}
9 | {9,10,11,12}
10 | {10,11,12,13}
(10 rows)

--4.2 查询SQL
mydb=> select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=1 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> ;
skyid | phone_list | array_length
-------+------------+--------------
2 | {2,3,4,5} | 3
3 | {3,4,5,6} | 2
4 | {4,5,6,7} | 1
(3 rows)

--4.3 查看执行计划

mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=8 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1) desc;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=3743.94..3745.19 rows=500 width=110) (actual time=1279.393..1279.423 rows=6 loops=1)
Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
Sort Method: quicksort Memory: 17kB
-> Nested Loop (cost=0.00..3721.53 rows=500 width=110) (actual time=0.651..1279.292 rows=6 loops=1)
Join Filter: ((t1.skyid <> t2.skyid) AND (t1.phone_list && t2.phone_list))
-> Index Scan using array_test_pkey on array_test t1 (cost=0.00..8.28 rows=1 width=57) (actual time=0.236..0.275 rows=1 loops=1)
Index Cond: (skyid = 8)
-> Seq Scan on array_test t2 (cost=0.00..2087.00 rows=100000 width=57) (actual time=0.013..608.045 rows=100000 loops=1)
Total runtime: 1279.619 ms
(9 rows)

--4.4创建 gin 索引
mydb=> create index concurrently idx_array_test_phone_list on array_test using gin (phone_list);
CREATE INDEX

--4.5 再次查看PLAN
mydb=> explain analyze select t2.skyid,t2.phone_list, array_length(intersection(t1.phone_list,t2.phone_list),1)
mydb-> from array_test t1, array_test t2
mydb-> where t1.skyid=7 and t1.skyid!=t2.skyid and t1.phone_list && t2.phone_list
mydb-> order by array_length(intersection(t1.phone_list,t2.phone_list),1) desc;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1070.18..1071.43 rows=500 width=110) (actual time=1.185..1.215 rows=6 loops=1)
Sort Key: (array_length(intersection(t1.phone_list, t2.phone_list), 1))
Sort Method: quicksort Memory: 17kB
-> Nested Loop (cost=19.88..1047.77 rows=500 width=110) (actual time=0.854..1.117 rows=6 loops=1)
Join Filter: (t1.skyid <> t2.skyid)
-> Index Scan using array_test_pkey on array_test t1 (cost=0.00..8.28 rows=1 width=57) (actual time=0.231..0.239 rows=1 loops=1)
Index Cond: (skyid = 7)
-> Bitmap Heap Scan on array_test t2 (cost=19.88..905.74 rows=500 width=57) (actual time=0.226..0.264 rows=7 loops=1)
Recheck Cond: (t1.phone_list && phone_list)
-> Bitmap Index Scan on idx_array_test_phone_list (cost=0.00..19.75 rows=500 width=0) (actual time=0.123..0.123 rows=7 loops=1)
Index Cond: (t1.phone_list && phone_list)
Total runtime: 1.399 ms
(12 rows)

备注：由于测试是在虚拟机上进行，数据量并不大，但从上面看出上面的SQL在创建了 gin 类型索引后，
执行时间在 1.3 毫秒左右，效率显著提高。

五参考
http://blog.163.com/digoal@126/blog/static/163877040201192624726272/
http://www.itfingers.com/Question/756871/postgres-function-to-return-the-intersection-of-2-arrays/zh
http://www.postgresql.org/docs/9.1/static/intarray.html
http://www.postgresql.org/docs/9.1/static/indexes-types.html

jamst8522127

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL: 如何获取一维数组的相同元素并根据相似度排序

PostgreSQL: 如何获取一维数组的相同元素并根据相似度排序 2011-11-02 22:24:09| 分类： Postgres基础 |举报|字号订阅今天开发有个需求，表中有一个列为一维数组类型，现在需要找出表中具有相同元素的数据，描述起来可能有点费力，下面举个例子就明白了。一需求演示--1.1测试表my...
复制链接

扫一扫