目录
排序类
1、有1亿个用户,存储在表users中,包含用户uid、用户年纪age、用户消费总金额total,其中以uid唯一标识1个用户,按照用户年龄从大到小排序,如果年龄相同则以消费总金额从小到大排序。
这是1个全排序问题,首先预估总内存消耗大小,1亿[用户数]*(8B[uid]+4B[age]+8B[total])约等于2G,在现有计算条件可以满足全部放入内存的需求,因此可以不必过多考虑优化问题。
-- 全局排序
SELECT *
FROM Users
ORDER BY age DESC,total ASC;
-- 局部(分桶)排序
SELECT *
FROM Users
DISTRIBUTE BY age
SORT BY age DESC,total ASC
分组类
1、有10万个店铺,每个顾客访问任意一个店铺时都会生成1条访问日志,表名未Visist,其中用户id字段未uid,访问的店铺字段未store,试统计每个店铺的uv。
SELECT store,COUNT(DISTINCT uid) uv
FROM Visit
GROUP BY store;
2、有1张表示人生阶段的表Lifestage,包含2个字段:用户唯一标识uid、人生阶段组合字段stage,其中stage由","分隔的字符串组成,如“计划买车,已经买房”,试统计每一个细分人生阶段的用户人数。
-- 列转行
SELECT stage_detail,COUNT(DISTINCT uid)
FROM Lifestage
LATERAL VIEW EXPLODE(SPLIT(stage,',')) Lifestage_tmp AS stage_detail
GROUP BY stage_detail
3、有1张表示人生阶段的表Lifestage,包含2个字段:用户唯一标识uid、人生阶段字段stage,每行存储一个用户的人生阶段数据,如一个用户43有2条记录:43,计划买车; 43,已经买房,试将同一个用户的所有人生阶段字段整合成一个用“,”分隔的组合字段,如“计划买车,已经买房”。
-- 行转列
SELECT uid,
CONCAT_WS(',', COLLECT_LIST(stage)) -- 如果一个用户stage 会有重复的话,则使用COLLECT_SET(stage)
FROM Lifestage
GROUP BY uid
4、1张学生成绩表course_t,包含学生sid、课程号course、成绩score几个字段,试得到语文成绩大于数学成绩的学生成绩数据。如
sid | course | score |
1 | yuwen | 43 |
1 | shuxue | 55 |
2 | yuwen | 77 |
2 | shuxue | 88 |
3 | yuwen | 98 |
3 | shuxue | 65 |
SELECT
*
FROM
(
SELECT sid,
MAX(CASE WHEN course='yuwen' THEN score
ELSE NULL
END) AS yuwen_score,
MAX(CASE WHEN course='shuxue' THEN score
ELSE NULL
END) AS shuxue_score
FROM mart_fsp_security_safetmp.course_t
GROUP BY sid
) course_tmp_t
WHERE yuwen_score>shuxue_score
;
-- 构造的数据
CREATE TABLE mart_fsp_security_safetmp.course_t AS
SELECT 1 AS id,1 AS sid,'yuwen' AS course,43 AS score
UNION ALL SELECT 2 AS id,1 AS sid,'shuxue' AS course,55 AS score
UNION ALL SELECT 3 AS id,2 AS sid,'yuwen' AS course,77 AS score
UNION ALL SELECT 4 AS id,2 AS sid,'shuxue' AS course,88 AS score
UNION ALL SELECT 5 AS id,3 AS sid,'yuwen' AS course,98 AS score
UNION ALL SELECT 6 AS id,3 AS sid,'shuxue' AS course,65 AS score
JOIN类
1、将下面的Address表,转成如后面所示的表
id | name | parent_id |