select
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
select可以是union的一部分,也可以是其他查询的子查询。
table_reference可以是表名、视图、join结构、子查询。
最简单查询例子
SELECT * FROM t1
带where部分的例子
SELECT * FROM sales WHERE amount > 10 AND region = "US"
ALL 、DISTINCT,默认是ALL,显示重复数据,使用DISTINCT不显示重复数据。
例子
hive> SELECT col1, col2 FROM t1
1 3
1 3
1 4
2 5
hive> SELECT DISTINCT col1, col2 FROM t1
1 3
1 4
2 5
hive> SELECT DISTINCT col1 FROM t1
1
2
分区查询
通常情况下,select查询是扫描全表,如果表在创建的时候使用了分区,查询的时候可以基于分区查询,这样就只扫描指定分区那部分。
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
如果是有join其他table在on里面指定分区信息
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
GROUP BY | HAVING Clause
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10;
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10;
hive.map.aggr参数默认是false,如果设置成ture,hive会在map task中直接进行第一级聚合( first-level aggregation)。这样效率更高,但是需要更多的存储空间。
LIMIT Clause
#偏移量为2,取5个。即是从 3rd到7th
SELECT * FROM customers ORDER BY create_date LIMIT 5;
SELECT * FROM customers ORDER BY create_date LIMIT 2,5;
REGEX Column Specification
可以用正则表达式来查询列。
需要配置 hive.support.quoted.identifiers。
SELECT `(ds|hr)?+.+` FROM sales
ORDER BY 全局排序,只有一个Reduce任务
SORT BY 只在本机做排序