distinct字段出现null
1.distinct 中字段出现null时,会使得计算结果不准确。原因有1. 所有的null值会被归并到一项;2. count结果并不会统计null项
SELECT DISTINCT id, value FROM table;
2.可以使用coalesce函数解决
SELECT DISTINCT(coalesce(ID, 0), coalesce(value, 0)) FROM table;
3.关于distinct 与group by
disticnt会让所有数据shuffle到一个reducer上,导致reducer数据倾斜严重,当数据量较大时,运行速度较慢。因此group by效率更优
join on 字段存在null
其中一个字段为null,连接的基础中存在一个字段为null,则待关联的字段均为null。
MySQL [dbs]> select * from test1;
+----+--------+---------+
| id | stu_id | stu_age |
+----+--------+---------+
| 1 | 1 | 25 |
| 2 | 1 | NULL |
+----+--------+---------+
2 rows in set (0.00 sec)
MySQL [dbs]> select * from test2;
+----+--------+---------+
| id | stu_id | stu_age |
+----+--------+---------+
| 1 | 1 | 25 |
| 2 | 1 | NULL |
+----+--------+---------+
2 rows in set (0.00 sec)
MySQL [dbs]> select * from test1
-> left join test2
-> on test1.stu_id = test2.stu_id
-> and test1.stu_age = test2.stu_age;
+----+--------+---------+------+--------+---------+
| id | stu_id | stu_age | id | stu_id | stu_age |
+----+--------+---------+------+--------+---------+
| 1 | 1 | 25 | 1 | 1 | 25 |
| 2 | 1 | NULL | NULL | NULL | NULL |
+----+--------+---------+------+--------+---------+
2 rows in set (0.00 sec)
MySQL [dbs]> select * from test1
-> left join test2
-> on test1.stu_id = test2.stu_id
-> and test1.stu_age = test2.stu_age
-> union all
-> select * from test1
-> right join test2
-> on test1.stu_id = test2.stu_id
-> and test1.stu_age = test2.stu_age;
+------+--------+---------+------+--------+---------+
| id | stu_id | stu_age | id | stu_id | stu_age |
+------+--------+---------+------+--------+---------+
| 1 | 1 | 25 | 1 | 1 | 25 |
| 2 | 1 | NULL | NULL | NULL | NULL |
| 1 | 1 | 25 | 1 | 1 | 25 |
| NULL | NULL | NULL | 2 | 1 | NULL |
+------+--------+---------+------+--------+---------+