一、Group by
1.1 select 后面的字段必须在groupby后面出现,除了一些聚合函数
select
uname,
floor(age/10) flag
from program
group by uname;
会报错:
FAILED: SemanticException [Error 10025]: Line 2:6 Expression not in GROUP BY key '10'
原因:group by 后并没有age
解决:
select
uname,
floor(age/10) flag
from program
group by uname,age;
聚合函数则不用必须出现在groupby之后,如:
select
uname,
sum(age)
from program
group by uname;
1.2开窗和group by 的字段
- 如果有group,那么row_number中的order by项必须是group内的字段,
- sum()over(partition by store) 只有partition 的话,好像求得sum是每个店铺的销售总额,窗口范围并不会到当前行截止,min函数也是这样的,其他的应该也是。
1.3groupby、orderby 和别名
先groupBy,后select,后orderby ,orderby 的字段名,还得是select 中的字段的名字。
select
name ,substr(dt,0,4),max(times)
from access
group by name,substr(dt,0,4) ;
可以
select
name ,substr(dt,0,4) as y,max(times)
from access
group by name,substr(dt,0,4) ;
可以
select
name ,substr(dt,0,4) as y,max(times)
from access
group by name,y ;
不可以
FAILED: SemanticException [Error 10004]:
Line 4:14 Invalid table alias or column reference 'y':
(possible column names are: name, dt, times)
select
name ,substr(dt,0,4) as y,max(times)
from access
group by name,substr(dt,0,4)
order by substr(dt,0,4) desc;
不可以
FAILED: SemanticException [Error 10004]: Line 5:16 Invalid
table alias or column reference 'dt': (possible column names
are: name, y, _c2)
select
name ,substr(dt,0,4) as y,max(times)
from access
group by name,substr(dt,0,4)
order by y desc;
可以
ps:mysql是不一样的,select 后面的别名可以出现在groupby后面
sparksql也是可以的;
二、一些常用的功能、函数
2.1 between范围
int型包头包尾,
对于parquet格式、以及时间好像有特殊情况,以后再研究
2.2 coalesce(x,y,z.....)
可以返回第一个不为null的值。,好像可以是不同类型的。
2.3 nvl(x,y)
NVL函数是一个空值转换函数NVL(表达式1,表达式2)如果表达式1为空值,NVL返回值为表达式2的值,否则返回表达式1的值。
三、开窗函数里使用sum函数
select
uname,programid,
row_number() over(partition by uname order by sum(playtime) desc) as rn
from program
group by uname,programid;
//这样报错
//在orderby后面加sum函数报错,因为select后面没有加上sum(playtime)
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:0 Expression not in GROUP BY key 'playtime'
四、having
select a.sid,a.sname,avg(b.score) as `平均成绩`
from student2 a
join sc2 b on a.sid=b.sid
group by a.sid,a.sname
having `平均成绩`>=60;
select 后面的别名,可以放在having后面
select a.sid,a.sname,round(avg(b.score),2) as `平均成绩`
from student2 a
join sc2 b on a.sid=b.sid
group by a.sid,a.sname
having `平均成绩`>=60;
//平均成绩如果不加飘号,则编译不通过
select a.sid,a.sname,avg(b.score) as av
from student2 a
join sc2 b on a.sid=b.sid
group by a.sid,a.sname
having av>=60;
//编译通过
select a.sid,a.sname,round(avg(b.score),2) as av
from student2 a
join sc2 b on a.sid=b.sid
group by a.sid,a.sname
having av>=60;
//编译不通过,为什么???
ps:执行顺序
(8)SELECT(9)DISTINCT <select_list>
(1)FROM <left_table>
(3)<join_type> JOIN <right_table>
(2)ON <join_condition>
(4)WHERE <where_condition>
(5)GROUP BY <grout_by_list>
(6)WITH {CUTE|ROLLUP}
(7)HAVING <having_condition>
(10)ORDER BY <order_by_list>
(11)LIMIT <limit_number>
五、with as
with q1 as (select * from src where key= '5'),
q2 as (select * from src s2 where key = '4')
select * from q1 union all select * from q2;
with as 可以把数据缓存在内存中,避免多次计算重复的数据,with as就类似于一个视图或临时表,可以用来存储一部分的sql语句作为别名,不同的是with as 属于一次性的,而且必须要和其他sql一起使用才可以
六、一些时间函数
hive (default)> select to_date("2018-01-02 09:23:23");
OK
_c0
2018-01-02
select date_format("2018-01-02 09:23:23","yyyy-MM-dd");
OK
_c0
2018-01-02
hive (myhive2)> select months_between('2019-3-2','2019-2-1');
OK
_c0
1.03225806
Time taken: 0.228 seconds, Fetched: 1 row(s)
七、建表语句
1、load
LOAD DATA [ LOCAL ] INPATH '/path/to/local/files'
[ OVERWRITE ] INTO TABLE test
PARTITION (country='CHINA')
有LOCAL表示从本地文件系统加载(文件会被拷贝到HDFS中)
无LOCAL表示从HDFS中加载数据(注意:文件直接被移动!!!而不是拷贝!!! 并且。。文件名都不带改的。。)
OVERWRITE 表示是否覆盖表中数据(或指定分区的数据)(没有OVERWRITE 会直接APPEND,而不会滤重!)