Task06:秋招秘籍 C
教程地址
https://github.com/datawhalechina/wonderful-sql
https://gitee.com/datawhalechina/wonderful-sql
1. 行转列
建表
mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> CREATE TABLE IF NOT EXISTS exercise
-> ( index_id INTEGER PRIMARY KEY,
-> cdate VARCHAR(10),
-> result VARCHAR(5)
-> );
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO exercise VALUES
-> (1, '2021-01-01', '胜'),
-> (2, '2021-01-01', '胜'),
-> (3, '2021-01-01', '负'),
-> (4, '2021-01-03', '胜'),
-> (5, '2021-01-03', '负'),
-> (6, '2021-01-03', '负')
-> ;
Query OK, 6 rows affected (0.01 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM exercise;
+----------+------------+--------+
| index_id | cdate | result |
+----------+------------+--------+
| 1 | 2021-01-01 | 胜 |
| 2 | 2021-01-01 | 胜 |
| 3 | 2021-01-01 | 负 |
| 4 | 2021-01-03 | 胜 |
| 5 | 2021-01-03 | 负 |
| 6 | 2021-01-03 | 负 |
+----------+------------+--------+
6 rows in set (0.00 sec)
查询
mysql> SELECT
-> cdate,
-> SUM(CASE WHEN result = '胜' THEN 1 ELSE 0 END) AS '胜',
-> SUM(CASE WHEN result = '负' THEN 1 ELSE 0 END) AS '负'
-> FROM
-> exercise
-> GROUP BY cdate
-> ;
+------------+------+------+
| cdate | 胜 | 负 |
+------------+------+------+
| 2021-01-01 | 2 | 1 |
| 2021-01-03 | 1 | 2 |
+------------+------+------+
2 rows in set (0.00 sec)
行转列的一般思路,先group,再用 case 设置条件值,最后 sum
2. 列转行
mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected (0.03 sec)
mysql> CREATE TABLE IF NOT EXISTS exercise
-> ( index_id INTEGER PRIMARY KEY,
-> cdate VARCHAR(10),
-> win INTEGER,
-> lose INTEGER
-> );
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO exercise VALUES
-> (1, '2021-01-01', 2, 1),
-> (2, '2021-01-03', 1, 2)
-> ;
Query OK, 2 rows affected (0.02 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql>
mysql> SELECT * FROM exercise;
+----------+------------+------+------+
| index_id | cdate | win | lose |
+----------+------------+------+------+
| 1 | 2021-01-01 | 2 | 1 |
| 2 | 2021-01-03 | 1 | 2 |
+----------+------------+------+------+
2 rows in set (0.00 sec)
中文列名一直报错,我改英文了,下面是查询
mysql> SELECT cdate, '胜' AS result, win AS cnt FROM exercise WHERE win > 0
-> UNION
-> SELECT cdate, '负' AS result, lose AS cnt FROM exercise WHERE lose > 0
-> ;
+------------+--------+------+
| cdate | result | cnt |
+------------+--------+------+
| 2021-01-01 | 胜 | 2 |
| 2021-01-03 | 胜 | 1 |
| 2021-01-01 | 负 | 1 |
| 2021-01-03 | 负 | 2 |
+------------+--------+------+
4 rows in set (0.01 sec)
暂时没啥思路了,下一步大概是用存储过程展开吧
但是那要遍历上面这个结果表的每一行吧,暂时没找到方法
但是如果只针对本题的话,我们可以发现计数最多只是 2
那么就展开两次,每次将对应的 cnt-1,然后 union
最后去掉 cnt 小于 0 的即可,但是这不通用
3. 连续登录
建表
mysql> DROP TABLE IF EXISTS t_act_records;
Query OK, 0 rows affected (0.03 sec)
mysql>
mysql> CREATE TABLE t_act_records(
-> uid VARCHAR(20),
-> imp_date DATE
-> );
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO t_act_records VALUES
-> ('u1001', 20210101),
-> ('u1002', 20210101),
-> ('u1003', 20210101),
-> ('u1003', 20210102),
-> ('u1004', 20210101),
-> ('u1004', 20210102),
-> ('u1004', 20210103),
-> ('u1004', 20210104),
-> ('u1004', 20210105);
Query OK, 9 rows affected (0.01 sec)
Records: 9 Duplicates: 0 Warnings: 0
mysql>
mysql> SELECT * FROM t_act_records;
+-------+------------+
| uid | imp_date |
+-------+------------+
| u1001 | 2021-01-01 |
| u1002 | 2021-01-01 |
| u1003 | 2021-01-01 |
| u1003 | 2021-01-02 |
| u1004 | 2021-01-01 |
| u1004 | 2021-01-02 |
| u1004 | 2021-01-03 |
| u1004 | 2021-01-04 |
| u1004 | 2021-01-05 |
+-------+------------+
9 rows in set (0.00 sec)
查询
mysql> SELECT
-> o2.uid,
-> MAX(cnt) AS mx_cnt
-> FROM (
-> SELECT
-> ori.uid,
-> ROW_NUMBER() OVER ( PARTITION BY ori.uid ORDER BY imp_date) AS cnt
-> FROM (
-> SELECT
-> uid,
-> imp_date,
-> EXTRACT(YEAR FROM imp_date) AS year,
-> EXTRACT(MONTH FROM imp_date) AS month
-> FROM t_act_records
-> ) AS ori
-> WHERE
-> ori.year = 2021
-> AND
-> ori.month = 1
-> ) AS o2
-> GROUP BY o2.uid
-> ;
+-------+--------+
| uid | mx_cnt |
+-------+--------+
| u1001 | 1 |
| u1002 | 1 |
| u1003 | 2 |
| u1004 | 5 |
+-------+--------+
4 rows in set (0.01 sec)
就也挺神奇的,我们得按月分组,也得按用户分组去统计登录次数
想了想,把月份的筛选放进 where 吧,和年份一起,每次只判断一个月
然后我们当然是按照用户的 uid 进行分组,然后按照记录的日期编号
原本想直接按照用户的 uid 进行 group,再对编号求个最大值,后来发现不行
那只好再套一个查询,用 group 和 max
4. 用户购买商品推荐
建表
mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql>
mysql> CREATE TABLE IF NOT EXISTS exercise
-> ( index_id INTEGER PRIMARY KEY,
-> user_id INTEGER,
-> product_id INTEGER
-> );
Query OK, 0 rows affected (0.07 sec)
mysql>
mysql> INSERT INTO exercise VALUES
-> (1, 123, 1),
-> (2, 123, 2),
-> (3, 123, 3),
-> (4, 456, 1),
-> (5, 456, 2),
-> (6, 456, 4)
-> ;
Query OK, 6 rows affected (0.01 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql>
mysql> SELECT * FROM exercise;
+----------+---------+------------+
| index_id | user_id | product_id |
+----------+---------+------------+
| 1 | 123 | 1 |
| 2 | 123 | 2 |
| 3 | 123 | 3 |
| 4 | 456 | 1 |
| 5 | 456 | 2 |
| 6 | 456 | 4 |
+----------+---------+------------+
6 rows in set (0.00 sec)
不搞通用解了,对这个题目起效就行
mysql> SELECT
-> ori.user_id,
-> SUM(flag) AS sim_cnt
-> FROM(
-> SELECT
-> *,
-> (CASE WHEN user_id = 123
-> -- 对于用户123的历史
-> THEN CASE WHEN product_id
-> IN (SELECT product_id FROM exercise WHERE user_id = 456)
-> THEN 1
-> -- 如果出现在456的历史就是1
-> ELSE 0
-> -- 没有的话就是0
-> END
-> -- 对于用户456的历史
-> ELSE CASE WHEN product_id
-> IN (SELECT product_id FROM exercise WHERE user_id = 123)
-> THEN 1
-> -- 如果出现在123的历史就是1
-> ELSE 0
-> -- 没有的话就是0
-> END
-> END
-> )AS flag
-> FROM exercise
-> ) AS ori
-> GROUP BY ori.user_id
-> ;
+---------+---------+
| user_id | sim_cnt |
+---------+---------+
| 123 | 2 |
| 456 | 2 |
+---------+---------+
2 rows in set (0.00 sec)
啧,花里胡哨的
总之就是,叠两层的 case,第一个判断用户是 123 还是 456
第二个 case 就是判断这个记录,在不在对方的列表当中
然后外面套一层 sum,对出现过的记录求和
两个都是 2,则互为相似用户
那接下来就简单了,复用上面的结构
SELECT
(CASE WHEN ori.user_id = 123 THEN 456 ELSE 123 END) AS rec_for,
ori.product_id
FROM (
SELECT
*,
(CASE WHEN user_id = 123
-- 对于用户123的历史
THEN CASE WHEN product_id
IN (SELECT product_id FROM exercise WHERE user_id = 456)
THEN 1
-- 如果出现在456的历史就是1
ELSE 0
-- 没有的话就是0
END
-- 对于用户456的历史
ELSE CASE WHEN product_id
IN (SELECT product_id FROM exercise WHERE user_id = 123)
THEN 1
-- 如果出现在123的历史就是1
ELSE 0
-- 没有的话就是0
END
END
)AS flag
FROM exercise
) AS ori
WHERE ori.flag = 0
ORDER BY rec_for
;
对于没有出现在用户 123 历史当中的物品,我们将其推荐给 123
对于 456 亦然,其条件就是,123 的历史推给 456,反之亦然
所以当 user_id 为 123,我们的 rec_for 就是 456
5. hive 数据倾斜的产生原因及优化策略?
数据倾斜,那就是分布不均匀啊
一些产生原因如下:
1)key分布不均匀
2)业务数据本身的特性
3)建表考虑不周全
4)某些HQL语句本身就存在数据倾斜
容易引起数据倾斜的操作如下:
关键词 情形 后果 join 其中一个表较小,但是key集中 分发到某一个或几个Reduce上的数据远高于平均值 大表与大表,但是分桶的判断字段0值或空值过多 这些空值都由一个reduce处理,非常慢 group by 维度过小,某值的数量过多 处理某值的reduce非常耗时 count distinct 某特殊值过多 处理次特殊值的reduce耗时
以上源自博客:https://www.cnblogs.com/suixingc/p/hive-de-shu-ju-qing-xie.html
6. LEFT JOIN 是否可能会出现多出的行?为什么?
根据图中示例来看,显然是会的
如果连接时的关键词,即 on 的 key,在左右表都有 null
那他们就会被视为一个 key,所以结果是笛卡尔积
左表的每一个 null 均与右表的每个 null 配对