1. Hive DQL
语法:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
1.1 WHERE
过滤筛选。分区表必须加分区限制
1.2 ALL和DISTINCT
ALL和DISTINCT选项指定了是否返回重复值,默认是ALL,指定DISTINCT时会剔除重复的结果。注意,从Hive1.1.0开始,Hive支持SELECT DISTINCT *
。ALL and DISTINCT can also be used in a UNION clause – see Union Syntax for more information.
hive> SELECT col1, col2 FROM t1
1 3
1 3
1 4
2 5
hive> SELECT DISTINCT col1, col2 FROM t1
1 3
1 4
2 5
hive> SELECT DISTINCT col1 FROM t1
1
2
1.3 基于分区的查询
一个Select如果没有指定分区,执行时会扫描全表。如果加了分区条件,则会只扫描对应分区的数据。对于存在大量分区数据的表,加不加分区条件的差别是很大的。若不增加分区,可能会引起资源被全占用,引起业务延迟。
Example:
# 普通查询限制分区
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
# 关联时限制分区
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
1.4 HAVING
- where和having的区别:
- 使用场景:where可以用于select、update、delete和insert into values(select * from table where …)语句中。having只能用于select语句中。
- 执行顺序:where的搜索条件是在执行语句进行分组之前应用;having的搜索条件是在分组条件后执行的。即如果where和having一起用时,where会先执行,having后执行。
- 子句有区别:where子句中的条件表达式having都可以跟,而having子句中的有些表达式where不可以跟;having子句可以用集合函数(sum、count、avg、max和min),而where子句不可以。
- 总结:
1.WHERE 子句用来筛选 FROM 子句中指定的操作所产生的行。
2.GROUP BY 子句用来分组 WHERE 子句的输出。
3.HAVING 子句用来从分组的结果中筛选行
以下两个等价
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10;
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10;
1.5 聚合函数
聚合: 多进一出,select中出现的字段,如果没有出现在group by中,必须出现在聚合函数中
max
min
avg
count
sum
3.6 问题
- Hive SQL ,什么情况跑MR,什么情况不跑MR?
了解一个参数:hive.fetch.task.conversion
,默认值是more
.
以下面语句为例:
SQL1 : select * from live_info;
SQL2 : select * from live_info where traffic in ("1", "3");
SQL3 : select count(1) from live_info;
SQL4 : select user, count(1) as user_count from live_info group by user;
SQL5 : select user, count(1) as user_count from live_info group by user order by user desc;
值 | 官网解释 | 说明 |
---|---|---|
none | Disable hive.fetch.task.conversion | select语句中只有desc不走MR |
minimal | SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only | 仅SQL1 不走MR |
more | SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns) | 仅SQL1、SQL2不走MR |
- 区别:Sort By / Cluster By / Distribute By / Order By.