TB级大表秒级任意维度分析 - 采样估值满足高效TOP N等分析需求

最新推荐文章于 2023-04-10 06:00:48 发布

云栖精选

最新推荐文章于 2023-04-10 06:00:48 发布

阅读量659

点赞数

文章标签：数据库

原文地址

背景

估值计算是统计学的常用手段。因为数据量庞大，求精确数值需要耗费巨大的资源，而统计分析并不要求完全精确的数据，因此估值计算是一种折中的方法，广泛应用于统计分析场景。

PostgreSQL是一个功能强大的数据库，在估值统计方面，提供了很多方法。

1、PostgreSQL中，求估计的UV，增量UV等(即count distinct)，可以通过HLL插件来实现。

《Greenplum 最佳实践 - 估值插件hll的使用(以及hll分式聚合函数优化)》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 3》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 2》

《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 1》

2、求任意字段的TOP VALUE(包括数组字段的TOP 元素)，以及COUNT，可以通过统计信息柱状图得到。

《PostgreSQL pg_stats used to estimate top N freps values and explain rows》

3、求全表记录数可以通过pg_class.reltuples得到。

4、求任意SQL的返回记录数(例如求分页)，或者求COUNT(*)的估值(将SQL转换为select 1 from ...即可)，可以通过explain的估值得到，例子如下。

《论count与offset使用不当的罪名和分页的优化》

5、求多个字段的唯一值个数，可以通过定义自定义统计信息得到非常准确的估值。

《PostgreSQL 10 黑科技 - 自定义统计信息》

6、求带条件的查询的估值，比如某个省的TOP N电影明星，我们可以通过先采样，然后在采样中进行计算的方法得到。

《PostgreSQL 巧妙的数据采样方法》

《PostgreSQL 数据采样与脱敏》

本文将介绍采样估值的方法。

场景设计

之前写过一个场景，是泛内容网站的透视分析，数据量比较庞大，计算全量需要扫描的数据量较大。看看采样的方法是否满足需求？

《音视图(泛内容)网站透视分析 DB设计 - 阿里云(RDS、HybridDB) for PostgreSQL最佳实践》

1、表结构

create table tbl (  
  id int8,  -- 序列    
  tag1 int[],   -- 数组  
  c1 int,       -- 1-100  
  c2 int,       -- 1-10000  
  c3 timestamp   -- 时间戳  
);

2、生成随机值函数

取值范围$1-$2, 取$3个随机值的数组  
  
create or replace function gen_rand_ints(int, int, int) returns int[] as $$  
  select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3));  
$$ language sql strict;  
  
postgres=# select gen_rand_ints(10,25,5);  
  gen_rand_ints     
------------------  
 {20,19,24,22,21}  
(1 row)

3、写入测试数据

-- 写入热点数组，5000万条  
insert into tbl select id, gen_rand_ints(1,1000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,50000000) t(id);  
  
-- 写入非热点数组，1亿条  
insert into tbl select id, gen_rand_ints(1,1000000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,100000000) t(id);

数据样式如下

postgres=# select * from tbl limit 10;  
    id    |                   tag1                    | c1 |  c2  |             c3               
----------+-------------------------------------------+----+------+----------------------------  
 38931521 | {424,448,91,420,382,657,677,60,530,503}   | 59 | 6120 | 2017-09-11 14:32:06.610512  
 38931522 | {66,87,468,207,79,780,307,714,520,149}    | 44 | 7848 | 2017-09-11 14:32:06.610522  
 38931523 | {99,628,798,558,415,74,863,839,522,953}   | 26 | 9032 | 2017-09-11 14:32:06.610531  
 38931524 | {610,935,962,140,438,551,752,503,636,220} | 71 | 7136 | 2017-09-11 14:32:06.61054  
 38931525 | {998,16,428,518,164,868,303,263,496,102}  | 82 | 9102 | 2017-09-11 14:32:06.61055  
 38931526 | {175,683,749,696,637,8,599,247,942,561}   | 39 | 3796 | 2017-09-11 14:32:06.610559  
 38931527 | {112,138,882,747,356,591,461,355,605,888} | 87 | 7684 | 2017-09-11 14:32:06.610568  
 38931528 | {756,175,31,252,276,850,162,450,533,910}  | 15 | 1691 | 2017-09-11 14:32:06.610578  
 38931529 | {917,744,416,860,306,801,240,416,937,122} | 16 | 2927 | 2017-09-11 14:32:06.610587  
 38931530 | {712,623,647,317,511,519,86,267,693,116}  | 52 | 9676 | 2017-09-11 14:32:06.610596  
(10 rows)

求任意条件下的tag1的TOP N元素。

4、分析表，生成柱状图。

postgres=# analyze tbl;  
ANALYZE

表大小 16 GB。

postgres=# \dt+ tbl  
                   List of relations  
 Schema | Name | Type  |  Owner   | Size  | Description   
--------+------+-------+----------+-------+-------------  
 public | tbl  | table | postgres | 16 GB |   
(1 row)

5、求某个条件下的精确TOP N元素，实际上有1000个热点ID，所以返回TOP 10的COUNT结果非常近似，后面在使用估值时，得到的TOP 10可能就没这么准了，但是一定是在1000个ID以内的。

-- 开启32个并行的查询时间  
  
postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
 tag1 | count   
------+-------  
  134 | 50935  
  768 | 50915  
  663 | 50876  
  567 | 50821  
  146 | 50821  
  332 | 50814  
  450 | 50807  
  884 | 50789  
   58 | 50781  
  605 | 50774  
(10 rows)  
  
Time: 23441.247 ms (00:23.441)  
  
-- 不开并行的查询时间  
postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
 tag1 | count   
------+-------  
  134 | 50935  
  768 | 50915  
  663 | 50876  
  567 | 50821  
  146 | 50821  
  332 | 50814  
  450 | 50807  
  884 | 50789  
   58 | 50781  
  605 | 50774  
(10 rows)  
  
Time: 154935.686 ms (02:34.936)

6、求同样条件下的采样TOP N

采样算法参考文章末尾，PostgreSQL内置了2种采样方法，同时支持扩展采样方法，其中有两个内置的扩展采样方法，实际上内置总共有4种采样方法。

使用块级采样（目前采样不支持并行）。

postgres=# select unnest(tag1) tag1, (count(*))*20      -- 乘以100/采样系数  
from   
(select * from tbl TABLESAMPLE system (5)) t     
where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
 tag1 | ?column?   
------+----------  
  724 |    53380  
  798 |    52680  
   24 |    52640  
  371 |    52480  
  569 |    52400  
  531 |    52280  
  979 |    52160  
  429 |    52140  
  980 |    52080  
  350 |    51920  
(10 rows)  
  
-- 采样5%，约7秒。  
Time: 6887.745 ms (00:06.888)   
  
postgres=# select unnest(tag1) tag1, (count(*))*50    -- 乘以100/采样系数  
from   
(select * from tbl TABLESAMPLE system (2)) t     
where c1 between 1 and 10 group by 1 order by 2 desc limit 10;  
 tag1 | ?column?   
------+----------  
  324 |    55450  
  435 |    55150  
  720 |    55050  
  943 |    54950  
  475 |    54750  
  958 |    54600  
   13 |    54400  
  742 |    54300  
  739 |    54100  
  301 |    53950  
(10 rows)  
  
-- 采样2%， 约3秒。  
Time: 2720.140 ms (00:02.720)  
  
采样越多，精确度越高

采样的方法，得到的TOP N是很准确的，因为例子用了1000个随机值，并且每个随机值的概率是一样的，如果返回TOP 1000，那就准确无疑了。

大表例子

重新设计热点，总共写入40亿测试数据：

一级热点，1，5亿

二级热点，2-4，5亿

三级热点，5-10，5亿

四级热点，11-30，5亿

普通数据，1-100000，20亿

1、表结构设计

create table tbl1 (  
  id int8,  -- 序列  
  c1 int8,  -- 目标字段  
  c2 int8,  -- 1-100  
  c3 int8,  -- 1-100000  
  c4 timestamp  -- 时间戳  
);

2、写入测试数据

nohup psql -c "insert into tbl1 select id, 1, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
nohup psql -c "insert into tbl1 select id, random()*(4-2)+2, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
nohup psql -c "insert into tbl1 select id, random()*(10-5)+5, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
nohup psql -c "insert into tbl1 select id, random()*(30-11)+11, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &  
nohup psql -c "insert into tbl1 select id, random()*100000, random()*100, random()*100000, clock_timestamp() from generate_series(1,2000000000) t(id);" >/dev/null 2>&1 &

3、分析表

postgres=# analyze tbl1;  
ANALYZE  
Time: 502.421 ms

表大小，254 GB。

postgres=# \dt+ tbl1  
                    List of relations  
 Schema | Name | Type  |  Owner   |  Size  | Description   
--------+------+-------+----------+--------+-------------  
 public | tbl1 | table | postgres | 254 GB |   
(1 row)

4、精确TOP 30

-- 开启32个并行的查询时间  
  
postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
 c1 |  count     
----+----------  
  1 | 49991259  
  3 | 25006580  
  2 | 12502559  
  4 | 12498741  
  9 | 10004285  
  6 | 10002597  
  8 |  9999530  
  7 |  9999215  
  5 |  5003219  
 10 |  4998870  
 29 |  2636193  
 18 |  2635457  
 13 |  2635344  
 17 |  2634693  
 26 |  2633965  
 19 |  2633690  
 28 |  2633526  
 14 |  2633512  
 15 |  2633363  
 24 |  2633260  
 20 |  2633014  
 25 |  2632926  
 16 |  2632779  
 22 |  2632508  
 27 |  2632288  
 23 |  2632216  
 21 |  2631443  
 12 |  2631315  
 11 |  1318483  
 30 |  1318451  
(30 rows)  
  
Time: 20845.738 ms (00:20.846)  
  
-- 不开启并行的查询时间  
  
postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
  
 c1 |  count     
----+----------  
  1 | 49991259  
  3 | 25006580  
  2 | 12502559  
  4 | 12498741  
  9 | 10004285  
  6 | 10002597  
  8 |  9999530  
  7 |  9999215  
  5 |  5003219  
 10 |  4998870  
 29 |  2636193  
 18 |  2635457  
 13 |  2635344  
 17 |  2634693  
 26 |  2633965  
 19 |  2633690  
 28 |  2633526  
 14 |  2633512  
 15 |  2633363  
 24 |  2633260  
 20 |  2633014  
 25 |  2632926  
 16 |  2632779  
 22 |  2632508  
 27 |  2632288  
 23 |  2632216  
 21 |  2631443  
 12 |  2631315  
 11 |  1318483  
 30 |  1318451  
(30 rows)  
  
Time: 471112.827 ms (07:51.113)

5、采样TOP 30

select c1,(count(*))*20 from   -- 乘以100/采样系数  
(select * from tbl1 TABLESAMPLE system (5)) t     
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
  
 c1 | ?column?   
----+----------  
  1 | 50068840  
  3 | 25108820  
  2 | 12558680  
  4 | 12513080  
  7 | 10009300  
  9 | 10006260  
  6 | 10005400  
  8 |  9987220  
  5 |  5008280  
 10 |  5007980  
 17 |  2652940  
 16 |  2648640  
 25 |  2646800  
 28 |  2646600  
 15 |  2642480  
 20 |  2642220  
 14 |  2641620  
 26 |  2640500  
 23 |  2639420  
 29 |  2637740  
 22 |  2637320  
 13 |  2636900  
 19 |  2636100  
 18 |  2635120  
 24 |  2634440  
 12 |  2631480  
 27 |  2629880  
 21 |  2624940  
 11 |  1330140  
 30 |  1316480  
(30 rows)  
  
Time: 31884.725 ms (00:31.885)  
  
-- 采样5%，约32秒。  
  
select c1,(count(*))*50 from   -- 乘以100/采样系数  
(select * from tbl1 TABLESAMPLE system (2)) t     
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
  
 c1 | ?column?   
----+----------  
  1 | 50173200  
  3 | 24993550  
  2 | 12487100  
  4 | 12474100  
  6 |  9998250  
  8 |  9980450  
  7 |  9973950  
  9 |  9960450  
 10 |  4999050  
  5 |  4995000  
 29 |  2642700  
 28 |  2640900  
 16 |  2640300  
 26 |  2630250  
 24 |  2627500  
 23 |  2623700  
 19 |  2622350  
 27 |  2622000  
 18 |  2621200  
 12 |  2619450  
 20 |  2616200  
 17 |  2616050  
 21 |  2615800  
 15 |  2613200  
 22 |  2612200  
 14 |  2607700  
 13 |  2605900  
 25 |  2604150  
 30 |  1312300  
 11 |  1311950  
(30 rows)  
  
Time: 12942.455 ms (00:12.942)  
  
-- 采样2%，约13秒。  
  
postgres=# select c1,(count(*))*1000 from   -- 乘以100/采样系数  
(select * from tbl1 TABLESAMPLE system (0.1)) t     
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;  
 c1 | ?column?   
----+----------  
  1 | 48077000  
  3 | 25061000  
  2 | 12762000  
  4 | 12262000  
  8 |  9851000  
  6 |  9789000  
  7 |  9718000  
  9 |  9654000  
  5 |  4971000  
 10 |  4885000  
 18 |  2731000  
 28 |  2727000  
 29 |  2710000  
 23 |  2697000  
 15 |  2687000  
 27 |  2681000  
 22 |  2672000  
 17 |  2672000  
 25 |  2670000  
 19 |  2637000  
 20 |  2632000  
 12 |  2628000  
 14 |  2628000  
 21 |  2622000  
 26 |  2618000  
 13 |  2601000  
 24 |  2522000  
 16 |  2513000  
 11 |  1406000  
 30 |  1301000  
(30 rows)  
  
Time: 863.604 ms  
  
-- 采样0.1%，约0.86秒。

OK，采样千分之一的时候（仅需约扫描254MB数据），只花了不到1秒，就算出了准确的TOP 30，而且准确度相当的高。

如果在Greenplum中支持这个功能，会很爽，一万亿的数据，秒级任意维度钻取透视不是梦。