Greenplum日志分析案例（GreenPlum企业级应用实战书）

最新推荐文章于 2024-07-09 11:37:21 发布

用心一

最新推荐文章于 2024-07-09 11:37:21 发布

阅读量407

点赞数

文章标签： postgresql

本文链接：https://blog.csdn.net/qq_28849581/article/details/123439446

版权

3.2 日志分析

是网站分析的基础，通过对网站浏览的日志进行分析，可以为网站优化提供数据支持，了解用户群以及用户浏览特性。

3.2.1 应用场景描述

分析全站每分钟的PV、UV，并导出到Excel中，画出折线图
解析uRL，获取URL中的参数列表。
通过URL取得member_id，然后统计当天浏览次数的用户分布，如分布在某个区间的用户分别有多少人。

3.2.2 数据demo

建表：

drop table if exists log_path;
create table log_path(
	log_time timestamp(0)		--浏览时间
	,cookie_id	varchar(256)	--浏览的cookie_id
	,url		varchar(1024)	--浏览页面的url
	,ip			varchar(64)		--用户ip
	,refer_url	varchar(1024)	--来源的url，这里只保留域名
	
)distributed by(cookie_id);

3.2.3 日志分析实战

PV、UV分布
cookie_id可以视为唯一的用户标识，古UV可视为去重后的cookie_id数。SQL如下：

select to_char(log_time,'yyyy-mm-dd HH24:mi:00')
	,count(1) pv
	,vount(distinct cookie_id) uv
from log_path
group by 1
order by 1;

将数据导出成csv‘格式，在Excel中展现，copy命令如下：

copy log_pv_uv_result to '/tmp/log_pv_uv.csv' csv;

解析URL参数
通过substring对URL进行正则匹配，取出域名。
split_part函数用于将字符串按照某个字符串进行分割，然后获取其中一个子串。
regexp_split_to_array函数可以将字符串按照某个字符串分割，然后转换为数组变量。

drop table if exists log_path_tmp1;
create table log_path_tmp1 as
select
	log_time
	,cookie_id
	,substring(url,E'\\w+://([\\w.]+)') AS host
	,split_part(url,'?',1) as url
	,substring(url,E'member[_]?[i|I]d=(\\w+)') AS member_id
	,regexp_split_to_array(split_part(url,'?',2),'&') AS paras
	,ip
	,refer_url
FROM log_path
DISTRIBUTED BY (cookie_id);

用户浏览次数区间分析
首先按照cookie_id做聚合，计算出每个cookie_id的浏览次数，之后再用case when对数据进行分区，在聚合，sql：

select case when cnt > 100 then '100+'
			when cnt>50    then '50-100'
			when cnt > 10  then '11-50'
			when cnt > 5   then '6-10'
			else '<5' end tab
			, count(1) as number
from
(
	select cookie_id,count(1) cnt
	from log_path_tmp1
	group by 1
	
) t
group by 1;

3.3 数据分布

了解Greenplum数据是如何分散在各个数据节点上的，有必要了解数据倾斜对数据加载、数据分析、数据导出的影响。

3.3.1 数据分散情况查看

测试数据生成

create table test_distribute_1
as 
select a as id
,round(random()) as flag
,repeat('a',1024) as value
form generate_series(1,5000000)a;

500万数据分散在6个数据节点，利用下面的SQL可以查看数据分布情况

select gp_segment_id,count(*)
from test_distribute_1
group by 1;

3.3.2 数据加载速度影响

测试在分布键不同的情况下数据加载的速度
（1）数据倾斜状态下的数据加载
1)测试数据准备，将测试数据导出：

copy test_distribute_1 to '/home/gpadmin/data/test_distribute.dat' with delimiter '|';

2)建立测试表，以flag字段为分布键

create table test_distribute_2 as select * form test_distribute_1 limit 0 distributed by(flag);

3)执行数据导入

time psql -h localhost -d testDB -c "copy test_distribute_2 from stdin with delimiter '|' < /home/gpadmin/data/test_distribute.dat"

3.3.3 数据查询速度影响

（1）数据倾斜状态下的数据查询

select gp_segment_id,count(*),max(length(value)) from test_distribute_2 group by 1;

(2)数据分布均匀状态下的数据查询

select gp_segment_id,count(*) ,max(length(value)) from test_distribute_3 group by 1;

3.4 数据压缩

3.4.1 数据加载速度的影响

加快

3.4.2 数据查询速度影响

加快

3.5 索引

Greenplum支持B-tree、bitmap、函数索引等，这里介绍一下B-tree索引

create table test_index_1 as select * from test_distribute_1;

select id,flag from test_index_1 where id = 100;

接下来在flag字段上创建bitmap索引

create table test_index_2_idx on test_index_1 (id);

再次查看执行计划，采用了索引扫描

explain select id , flag from test_index_1 where id=100;

时间从2606ms减少到23ms。

用心一

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫