PostgreSQL 的Distinct on

今天上午处理一个SQL语句的问题,情况是,有一个clicks表,以及geoip和geoip_location表。
clicks表中存取的是客户的点击数,geoip和geoip_location分别存储了各个国家的ip网段和地址信息。
通过如下SQL可以看出每一个ip符合条件的点击数

select count(click_ip) as count,click_ip from
fb_fy_clicks
where earn_id = 191 AND if_pay='t' AND date(crdate)='2007-12-24'
group by click_ip;

下面通过ip找到所在的国家

select * from
(select count(click_ip) as count,click_ip from fb_fy_clicks
where earn_id = 191 AND if_pay='t' AND date(crdate)='2007-12-24'
group by click_ip) as tb
left join geoip G
on ip4(tb.click_ip) <<= G.network
left join geoip_location as geoip
on G.location_id = geoip.id

这样虽然说通过ip找到国家了,但是有的ip是别人通过代理或者伪装的ip,所以有多条ip的重复记录,尽管这些只是少数,但是我们需要把这些对应多个国家的ip按照1个来算。如果仅仅使用distinct我们无法过滤掉多出来的国家。正好postgresql有distinct on

文档记载:

我们还可以用任意表达式来判断什么行可以认为是独立的:

SELECT DISTINCT ON (expression [, expression ...]) select_list ...

这里 expression 是任意值表达式, 它为所有行计算。如果一个行集合里所有行计算出的该表达式的值是一样的, 那么我们认为它们是重复的并且因此只有第一行保留在输出中。 请注意这里的一个集合的"第一行"是不可预料的, 除非你在足够多的字段上对该查询排了序,保证到达DISTINCT过滤器的行的顺序是唯一的。 (DISTINCT ON处理是发生在ORDER BY排序后面的。)

DISTINCT ON子句不是 SQL 标准的一部分, 有时候有人认为它是一个糟糕的风格,因为它的结果是不可判定的。 如果用有选择的GROUP BY和在FROM中的子查询,那么我们可以避免使用这个构造, 但是通常它是更方便的候选方法。

正好符合需求:

将SQL改为:


select distinct on (click_ip) click_ip,network,country,count from (
(select count(click_ip) as count,click_ip from fb_fy_clicks
where earn_id = 191 AND if_pay='t' AND date_trunc('day',crdate)='2007-12-24'
group by click_ip) as tb left join geoip G
on ip4(tb.click_ip) <<= G.network
left join geoip_location as geoip
on G.location_id = geoip.id) as tmp


还有一种情况就是,有些client_ip重复的记录不需要被过滤掉,比如一些client_ip重复的记录,但是他的data_platform字段是不一样的,这样的记录是不应该被过滤掉的。这时,我们使用

SELECT DISTINCT ON (client_ip,data_platform) client_ip,SUM(impressions) AS impressions,crdate,data_platform from views a,ads b
WHERE a.ad_id = b.id
GROUP BY client_ip,data_platform,crdate


真实例子:


SELECT ad_id,country,SUM(impressions) AS impressions,dates,source FROM
(SELECT DISTINCT ON (client_ip,ad_id,source) client_ip,ad_id,COALESCE(country,'OTHER') AS country,impressions,dates,source FROM
(((SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.fb_uid) AS impressions, 0 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id AND a.fb_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc)
UNION ALL
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.bebo_uid) AS impressions, 1 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id AND a.bebo_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc )UNION ALL
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.ms_uid) AS impressions, 2 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id AND a.ms_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc )
UNION ALL
(SELECT DATE_TRUNC('DAY',a.crdate) AS dates,b.ad_id,a.client_ip,COUNT(a.hi5_uid) AS impressions, 3 AS source FROM views_20081020 a,ads b WHERE a.ad_id = b.id AND a.hi5_uid IS NOT NULL GROUP BY b.ad_id,a.client_ip,dates ORDER BY b.ad_id asc)) AS tmp_ads_imprs_country_stats
LEFT JOIN geoip G ON ip4(tmp_ads_imprs_country_stats.client_ip) <<= G.network LEFT JOIN geoip_location AS geoip_loc ON G.location_id = geoip_loc.id ) AS tmp_ads_imprs_country_stats_t ) AS FOO GROUP BY ad_id,country,dates,source ORDER BY ad_id ASC;
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值