Task06:秋招秘籍 C

教程地址

https://github.com/datawhalechina/wonderful-sql
https://gitee.com/datawhalechina/wonderful-sql

1. 行转列

建表

mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> CREATE TABLE IF NOT EXISTS exercise
    -> ( index_id INTEGER PRIMARY KEY,
    ->   cdate VARCHAR(10),
    ->   result VARCHAR(5)
    -> );
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO exercise VALUES 
    ->  (1, '2021-01-01', '胜'),
    ->  (2, '2021-01-01', '胜'),
    ->  (3, '2021-01-01', '负'),
    ->  (4, '2021-01-03', '胜'),
    ->  (5, '2021-01-03', '负'),
    ->  (6, '2021-01-03', '负')
    -> ;
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM exercise;
+----------+------------+--------+
| index_id | cdate      | result |
+----------+------------+--------+
|        1 | 2021-01-01 ||
|        2 | 2021-01-01 ||
|        3 | 2021-01-01 ||
|        4 | 2021-01-03 ||
|        5 | 2021-01-03 ||
|        6 | 2021-01-03 ||
+----------+------------+--------+
6 rows in set (0.00 sec)

查询

mysql> SELECT
    ->  cdate,
    ->  SUM(CASE WHEN result = '胜' THEN 1 ELSE 0 END) AS '胜',
    ->  SUM(CASE WHEN result = '负' THEN 1 ELSE 0 END) AS '负'
    -> FROM
    ->  exercise
    -> GROUP BY cdate
    -> ;
+------------+------+------+
| cdate      |||
+------------+------+------+
| 2021-01-01 |    2 |    1 |
| 2021-01-03 |    1 |    2 |
+------------+------+------+
2 rows in set (0.00 sec)

行转列的一般思路,先group,再用 case 设置条件值,最后 sum

2. 列转行

mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected (0.03 sec)

mysql> CREATE TABLE IF NOT EXISTS exercise
    -> ( index_id INTEGER PRIMARY KEY,
    ->   cdate VARCHAR(10),
    ->   win INTEGER,
    ->   lose INTEGER
    -> );
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO exercise VALUES 
    ->  (1, '2021-01-01', 2, 1),
    ->  (2, '2021-01-03', 1, 2)
    -> ;
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> 
mysql> SELECT * FROM exercise;
+----------+------------+------+------+
| index_id | cdate      | win  | lose |
+----------+------------+------+------+
|        1 | 2021-01-01 |    2 |    1 |
|        2 | 2021-01-03 |    1 |    2 |
+----------+------------+------+------+
2 rows in set (0.00 sec)

中文列名一直报错,我改英文了,下面是查询

mysql> SELECT cdate, '胜' AS result, win AS cnt FROM exercise WHERE win > 0
    -> UNION
    -> SELECT cdate, '负' AS result, lose AS cnt FROM exercise WHERE lose > 0
    -> ;
+------------+--------+------+
| cdate      | result | cnt  |
+------------+--------+------+
| 2021-01-01 ||    2 |
| 2021-01-03 ||    1 |
| 2021-01-01 ||    1 |
| 2021-01-03 ||    2 |
+------------+--------+------+
4 rows in set (0.01 sec)

暂时没啥思路了,下一步大概是用存储过程展开吧

但是那要遍历上面这个结果表的每一行吧,暂时没找到方法

但是如果只针对本题的话,我们可以发现计数最多只是 2

那么就展开两次,每次将对应的 cnt-1,然后 union

最后去掉 cnt 小于 0 的即可,但是这不通用

3. 连续登录

建表

mysql> DROP TABLE IF EXISTS t_act_records;
Query OK, 0 rows affected (0.03 sec)

mysql> 
mysql> CREATE TABLE t_act_records(
    ->  uid  VARCHAR(20),
    ->  imp_date DATE
    -> );
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO t_act_records VALUES
    -> ('u1001', 20210101),
    -> ('u1002', 20210101),
    -> ('u1003', 20210101),
    -> ('u1003', 20210102),
    -> ('u1004', 20210101),
    -> ('u1004', 20210102),
    -> ('u1004', 20210103),
    -> ('u1004', 20210104),
    -> ('u1004', 20210105);
Query OK, 9 rows affected (0.01 sec)
Records: 9  Duplicates: 0  Warnings: 0

mysql> 
mysql> SELECT * FROM t_act_records;
+-------+------------+
| uid   | imp_date   |
+-------+------------+
| u1001 | 2021-01-01 |
| u1002 | 2021-01-01 |
| u1003 | 2021-01-01 |
| u1003 | 2021-01-02 |
| u1004 | 2021-01-01 |
| u1004 | 2021-01-02 |
| u1004 | 2021-01-03 |
| u1004 | 2021-01-04 |
| u1004 | 2021-01-05 |
+-------+------------+
9 rows in set (0.00 sec)

查询

mysql> SELECT
    ->  o2.uid,
    ->  MAX(cnt) AS mx_cnt
    -> FROM (
    ->  SELECT
    ->   ori.uid,
    ->   ROW_NUMBER() OVER ( PARTITION BY ori.uid ORDER BY imp_date) AS cnt
    ->  FROM (
    ->   SELECT 
    ->    uid,
    ->    imp_date,
    ->    EXTRACT(YEAR   FROM imp_date) AS year,
    ->    EXTRACT(MONTH  FROM imp_date) AS month
    ->   FROM t_act_records
    ->  ) AS ori
    ->  WHERE
    ->   ori.year = 2021
    ->    AND
    ->   ori.month = 1
    -> ) AS o2
    -> GROUP BY o2.uid
    -> ;
+-------+--------+
| uid   | mx_cnt |
+-------+--------+
| u1001 |      1 |
| u1002 |      1 |
| u1003 |      2 |
| u1004 |      5 |
+-------+--------+
4 rows in set (0.01 sec)

就也挺神奇的,我们得按月分组,也得按用户分组去统计登录次数

想了想,把月份的筛选放进 where 吧,和年份一起,每次只判断一个月

然后我们当然是按照用户的 uid 进行分组,然后按照记录的日期编号

原本想直接按照用户的 uid 进行 group,再对编号求个最大值,后来发现不行

那只好再套一个查询,用 group 和 max

4. 用户购买商品推荐

建表

mysql> DROP TABLE IF EXISTS exercise;
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> 
mysql> CREATE TABLE IF NOT EXISTS exercise
    -> ( index_id INTEGER PRIMARY KEY,
    ->   user_id INTEGER,
    ->   product_id INTEGER
    -> );
Query OK, 0 rows affected (0.07 sec)

mysql> 
mysql> INSERT INTO exercise VALUES 
    ->  (1, 123, 1),
    ->  (2, 123, 2),
    ->  (3, 123, 3),
    ->  (4, 456, 1),
    ->  (5, 456, 2),
    ->  (6, 456, 4)
    -> ;
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> 
mysql> SELECT * FROM exercise;
+----------+---------+------------+
| index_id | user_id | product_id |
+----------+---------+------------+
|        1 |     123 |          1 |
|        2 |     123 |          2 |
|        3 |     123 |          3 |
|        4 |     456 |          1 |
|        5 |     456 |          2 |
|        6 |     456 |          4 |
+----------+---------+------------+
6 rows in set (0.00 sec)


不搞通用解了,对这个题目起效就行

mysql> SELECT
    ->   ori.user_id,
    ->   SUM(flag) AS sim_cnt
    -> FROM(
    ->   SELECT
    ->    *,
    ->    (CASE WHEN user_id = 123
    ->          -- 对于用户123的历史
    ->          THEN CASE WHEN product_id
    ->                      IN (SELECT product_id FROM exercise WHERE user_id = 456)
    ->                    THEN 1
    ->                    -- 如果出现在456的历史就是1
    ->                    ELSE 0
    ->                    -- 没有的话就是0
    ->                    END
    ->          -- 对于用户456的历史
    ->          ELSE CASE WHEN product_id
    ->                      IN (SELECT product_id FROM exercise WHERE user_id = 123)
    ->                    THEN 1
    ->                    -- 如果出现在123的历史就是1
    ->                    ELSE 0
    ->                    -- 没有的话就是0
    ->                    END
    ->          END
    ->    )AS flag
    ->   FROM exercise
    -> ) AS ori
    -> GROUP BY ori.user_id
    -> ;
+---------+---------+
| user_id | sim_cnt |
+---------+---------+
|     123 |       2 |
|     456 |       2 |
+---------+---------+
2 rows in set (0.00 sec)

啧,花里胡哨的

总之就是,叠两层的 case,第一个判断用户是 123 还是 456

第二个 case 就是判断这个记录,在不在对方的列表当中

然后外面套一层 sum,对出现过的记录求和

两个都是 2,则互为相似用户

那接下来就简单了,复用上面的结构

SELECT
 (CASE WHEN ori.user_id = 123 THEN 456 ELSE 123 END) AS rec_for,
 ori.product_id
FROM (
  SELECT
   *,
   (CASE WHEN user_id = 123
         -- 对于用户123的历史
         THEN CASE WHEN product_id
                     IN (SELECT product_id FROM exercise WHERE user_id = 456)
                   THEN 1
                   -- 如果出现在456的历史就是1
                   ELSE 0
                   -- 没有的话就是0
                   END
         -- 对于用户456的历史
         ELSE CASE WHEN product_id
                     IN (SELECT product_id FROM exercise WHERE user_id = 123)
                   THEN 1
                   -- 如果出现在123的历史就是1
                   ELSE 0
                   -- 没有的话就是0
                   END
         END
   )AS flag
  FROM exercise
) AS ori
WHERE ori.flag = 0
ORDER BY rec_for
;

对于没有出现在用户 123 历史当中的物品,我们将其推荐给 123

对于 456 亦然,其条件就是,123 的历史推给 456,反之亦然

所以当 user_id 为 123,我们的 rec_for 就是 456

5. hive 数据倾斜的产生原因及优化策略?

数据倾斜,那就是分布不均匀啊

一些产生原因如下:

1)key分布不均匀

2)业务数据本身的特性

3)建表考虑不周全

4)某些HQL语句本身就存在数据倾斜

容易引起数据倾斜的操作如下:

关键词情形后果
join其中一个表较小,但是key集中分发到某一个或几个Reduce上的数据远高于平均值
大表与大表,但是分桶的判断字段0值或空值过多这些空值都由一个reduce处理,非常慢
group by维度过小,某值的数量过多处理某值的reduce非常耗时
count distinct某特殊值过多处理次特殊值的reduce耗时

以上源自博客:https://www.cnblogs.com/suixingc/p/hive-de-shu-ju-qing-xie.html

6. LEFT JOIN 是否可能会出现多出的行?为什么?

根据图中示例来看,显然是会的

如果连接时的关键词,即 on 的 key,在左右表都有 null

那他们就会被视为一个 key,所以结果是笛卡尔积

左表的每一个 null 均与右表的每个 null 配对

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值