首先,使用标准SQL,给出您的假设:
有一个简单布局的表EVENTS:
EVENTS
-----------------------------
SESION_ID , EVENT_NAME , TMST
要在某个时间获得执行步骤#1的会话:
-- QUERY 1
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;
在这里,我假设event1可以在每个会话中发生一次以上.结果是一个在某个时间演示了event1的唯一会话列表.
为了获得step2和step3,我可以这样做:
-- QUERY 2
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID;
-- QUERY 3
SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;
现在,您要按顺序选择执行step1,step2和step3的会话.
更准确地说,您需要计算执行步骤1的会话,然后计算执行步骤2的会话,然后计算执行步骤3的会话.
基本上我们只需要将上面的3个查询与左连接组合在一起,列出进入漏斗的会话以及它们执行的步骤:
-- FUNNEL FOR S1/S2/S3
SELECT
SESSION_ID,
Q1.TMST IS NOT NULL AS PERFORMED_STEP1,
Q2.TMST IS NOT NULL AS PERFORMED_STEP2,
Q3.TMST IS NOT NULL AS PERFORMED_STEP3
FROM
-- QUERY 1
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1,
LEFT JOIN
-- QUERY 2
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2,
LEFT JOIN
-- QUERY 3
(SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3
-- Q2 & Q3
ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST
-- Q1 & Q2
ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST
结果是在步骤1进入漏斗的唯一会话列表,并且可能继续到step2和step3 ……例如:
SESSION_ID_1,TRUE,TRUE,TRUE
SESSION_ID_2,TRUE,TRUE,FALSE
SESSION_ID_3,TRUE,FALSE,FALSE
...
现在我们只需要计算一些统计数据,例如:
SELECT
STEP1_COUNT,
STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1,
STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2,
STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2,
STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3,
STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE
FROM
(-- QUERY TO COUNT session at each step
SELECT
SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT,
SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT,
SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT
FROM
[... insert the funnel query here ...]
) AS COMPUTE_STEPS
Etvoilà!
现在进行讨论.
第一点,考虑到你采用“集合”(或功能)思维方式而不是“程序化”方法,结果非常简单.不要将数据库可视化为具有列和行的固定表的集合……这是它的实现方式,但它不是您与它交互的方式.这是所有套装,您可以按照您需要的方式安排套装!
第二点,如果您使用MPP数据库,查询将自动优化为并行运行.您甚至不需要以不同方式编写查询,使用map-reduce或其他…我在我的测试数据集上运行了相同的查询,其中包含超过1亿个事件,并在几秒钟内获得结果.
最后但并非最不重要的是,查询打开了无限可能.只需按引用,关键字,登录页面,用户信息和分析的结果进行分组,例如提供最佳转换率!