近日在做一个数据分析时,发现DISTINCT比较慢,想通过group by来替代。然而替代并非一帆风顺,在替代过程中,发现对于重复数据,group by会取第一次出现的记录。为得到我想要的统计数据,折腾了一番。下面用实例来描述我的问题及调整过程。
场景:有一张表,记录手机用户的常规信息,每天每个用户一条记录
CREATE TABLE `userinfo_test` (
`day` date NOT NULL DEFAULT '2016-06-01',
`username` varchar(64) NOT NULL,
`phone` varchar(16) NOT NULL DEFAULT '',
`cv` varchar(16) NOT NULL DEFAULT '',
PRIMARY KEY (`day`,`username`),
KEY `ix_day_username_phone_cv` (`day`,`username`,`phone`,`cv`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
现在需要统计每个月,每个机型每个版本的活跃人数(对于同一用户,如果升级了版本,只统计升级后的版本)
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000001','A', '1001');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000002','A', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000003','A', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000004','B', '1001');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000005','B', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000006','B', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000007','C', '1001');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000008','C', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-01','10000009','C', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000001','A', '1001');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000002','A', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000003','A', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000004','B', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000005','B', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000006','B', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000007','C', '1003');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000008','C', '1002');
INSERT INTO `userinfo_test` VALUES('2016-05-02','10000009','C', '1003');
记录每天的用户常规信息,若当天信息有变化,只记录最后的信息,每个用户每天只有一条记录信息
mysql> SELECT * FROM `userinfo_test`;
+------------+----------+-------+------+
| day | username | phone | cv |
+------------+----------+-------+------+
| 2016-05-01 | 10000001 | A | 1001 |
| 2016-05-01 | 10000002 | A | 1002 |
| 2016-05-01 | 10000003 | A | 1003 |
| 2016-05-01 | 10000004 | B | 1001 |
| 2016-05-01 | 10000005 | B | 1002 |
| 2016-05-01 | 10000006 | B | 1003 |
| 2016-05-01 | 10000007 | C | 1001 |
| 2016-05-01 | 10000008 | C | 1002 |
| 2016-05-01 | 10000009 | C | 1003 |
| 2016-05-02 | 10000001 | A | 1001 |
| 2016-05-02 | 10000002 | A | 1003 |
| 2016-05-02 | 10000003 | A | 1003 |
| 2016-05-02 | 10000004 | B | 1002 |
| 2016-05-02 | 10000005 | B | 1002 |
| 2016-05-02 | 10000006 | B | 1003 |
| 2016-05-02 | 10000007 | C | 1003 |
| 2016-05-02 | 10000008 | C | 1002 |
| 2016-05-02 | 10000009 | C | 1003 |
| 2016-05-02 | 10000019 | C | 1003 |
+------------+----------+-------+------+
19 rows in set (0.00 sec)
汇总每天各机型各版本的人数
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v, COUNT(cv)daynum FROM `userinfo_test` WHERE DAY = '2016-05-01' GROUP BY phone,cv WITH ROLLUP;
mysql> SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v, COUNT(cv)daynum FROM `userinfo_test` WHERE DAY = '2016-05-01' GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+--------+
| p | v | daynum |
+-------------------+-------------------+--------+
| A | 1001 | 1 |
| A | 1002 | 1 |
| A | 1003 | 1 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1001 | 1 |
| B | 1002 | 1 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1001 | 1 |
| C | 1002 | 1 |
| C | 1003 | 1 |
| C | WITH_ROLLUP_TOTAL | 3 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 9 |
+-------------------+-------------------+--------+
13 rows in set (0.00 sec)
mysql> SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v, COUNT(cv)daynum FROM `userinfo_test` WHERE DAY = '2016-05-02' GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+--------+
| p | v | daynum |
+-------------------+-------------------+--------+
| A | 1001 | 1 |
| A | 1003 | 2 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1002 | 2 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1002 | 1 |
| C | 1003 | 3 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+--------+
10 rows in set (0.00 sec)
汇总每月各机型各版本的人数,若该月同一人有多个机型版本信息,则以最后记录的机型版本信息为准,只汇总这一条记录
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v,COUNT(cv) monthnum FROM (SELECT phone,cv FROM (SELECT * FROM (SELECT cv,phone,username FROM `userinfo_test` GROUP BY cv,phone,username) c ORDER BY cv DESC) d GROUP BY username) e GROUP
BY phone,cv WITH ROLLUP;
+-------------------+-------------------+----------+
| p | v | monthnum |
+-------------------+-------------------+----------+
| A | 1001 | 1 |
| A | 1003 | 2 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1002 | 2 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1002 | 1 |
| C | 1003 | 3 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+----------+
10 rows in set (0.00 sec)
注意,上述sql 对cv 做了降序排列(考虑到通常都是升级了版本,即认为版本越大,代表了同一用户的最后的记录),group by 默认统计第一次出现的记录
如果不对cv做降序
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v,COUNT(cv) monthnum FROM (SELECT phone,cv,username FROM `userinfo_test` GROUP BY username) c GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+----------+
| p | v | monthnum |
+-------------------+-------------------+----------+
| A | 1001 | 1 |
| A | 1002 | 1 |
| A | 1003 | 1 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1001 | 1 |
| B | 1002 | 1 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1001 | 1 |
| C | 1002 | 1 |
| C | 1003 | 2 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+----------+
13 rows in set (0.00 sec)
可以看到,对于同一用户,统计到的是第一次出现的记录。
如果用DISTINCT,下面的sql并不能得到我们想要的数据。
此外对于重复数据比较多时,group by要比DISTINCT快很多
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v, COUNT(DISTINCT username)daynum FROM `userinfo_test` GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+--------+
| p | v | daynum |
+-------------------+-------------------+--------+
| A | 1001 | 1 |
| A | 1002 | 1 |
| A | 1003 | 2 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1001 | 1 |
| B | 1002 | 2 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1001 | 1 |
| C | 1002 | 1 |
| C | 1003 | 3 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+--------+
13 rows in set (0.01 sec)
改用如下sql可以获得想要的结果
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v, COUNT(cv)daynum FROM(SELECT * FROM (SELECT DISTINCT username,phone,cv FROM `userinfo_test` ORDER BY cv DESC) c GROUP BY username)d GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+--------+
| p | v | daynum |
+-------------------+-------------------+--------+
| A | 1001 | 1 |
| A | 1003 | 2 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1002 | 2 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1002 | 1 |
| C | 1003 | 3 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+--------+
10 rows in set (0.00 sec)
group by 支持排序,group by的排序是按最后一个字段排序的
SELECT phone,cv,COUNT(cv)num FROM `userinfo_test` GROUP BY phone,cv WITH ROLLUP;
+-------+------+-----+
| phone | cv | num |
+-------+------+-----+
| A | 1001 | 2 |
| A | 1002 | 1 |
| A | 1003 | 3 |
| A | NULL | 6 |
| B | 1001 | 1 |
| B | 1002 | 3 |
| B | 1003 | 2 |
| B | NULL | 6 |
| C | 1001 | 1 |
| C | 1002 | 2 |
| C | 1003 | 4 |
| C | NULL | 7 |
| NULL | NULL | 19 |
+-------+------+-----+
13 rows in set (0.00 sec)
SELECT phone,cv,COUNT(cv)num FROM `userinfo_test` GROUP BY phone,cv DESC WITH ROLLUP;
+-------+------+-----+
| phone | cv | num |
+-------+------+-----+
| A | 1003 | 3 |
| A | 1002 | 1 |
| A | 1001 | 2 |
| A | NULL | 6 |
| B | 1003 | 2 |
| B | 1002 | 3 |
| B | 1001 | 1 |
| B | NULL | 6 |
| C | 1003 | 4 |
| C | 1002 | 2 |
| C | 1001 | 1 |
| C | NULL | 7 |
| NULL | NULL | 19 |
+-------+------+-----+
13 rows in set (0.01 sec)
explain SELECT cv,phone,COUNT(phone)num FROM `userinfo_test` GROUP BY cv,phone DESC WITH ROLLUP;
+------+-------+-----+
| cv | phone | num |
+------+-------+-----+
| 1001 | C | 1 |
| 1001 | B | 1 |
| 1001 | A | 2 |
| 1001 | NULL | 4 |
| 1002 | C | 2 |
| 1002 | B | 3 |
| 1002 | A | 1 |
| 1002 | NULL | 6 |
| 1003 | C | 4 |
| 1003 | B | 2 |
| 1003 | A | 3 |
| 1003 | NULL | 9 |
| NULL | NULL | 19 |
+------+-------+-----+
13 rows in set (0.00 sec)
如果用到group by 的排序,前面月活跃用户数的统计语句
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v,COUNT(cv) monthnum FROM (SELECT phone,cv FROM (SELECT * FROM (SELECT cv,phone,username FROM `userinfo_test` GROUP BY cv,phone,username) c ORDER BY cv DESC) d GROUP BY username) e GROUP
BY phone,cv WITH ROLLUP;
可以修改为
SELECT IFNULL(phone,'WITH_ROLLUP_TOTAL') p, IFNULL(cv,'WITH_ROLLUP_TOTAL') v,COUNT(cv) monthnum FROM (SELECT phone,cv FROM (SELECT cv,phone,username FROM `userinfo_test` GROUP BY username,phone,cv DESC) c GROUP BY username) d GROUP BY phone,cv WITH ROLLUP;
+-------------------+-------------------+----------+
| p | v | monthnum |
+-------------------+-------------------+----------+
| A | 1001 | 1 |
| A | 1003 | 2 |
| A | WITH_ROLLUP_TOTAL | 3 |
| B | 1002 | 2 |
| B | 1003 | 1 |
| B | WITH_ROLLUP_TOTAL | 3 |
| C | 1002 | 1 |
| C | 1003 | 3 |
| C | WITH_ROLLUP_TOTAL | 4 |
| WITH_ROLLUP_TOTAL | WITH_ROLLUP_TOTAL | 10 |
+-------------------+-------------------+----------+
10 rows in set (0.01 sec)
后记:
1.由于天与天之间重复数据较多,因此group by 比DISTINCT 相对来说有优势
2.感觉sql还是写的比较复杂,欢迎指点,优化sql或者有更好的方法