常用hivesql记录

最新推荐文章于 2022-08-09 10:32:44 发布

zjliang2053

最新推荐文章于 2022-08-09 10:32:44 发布

阅读量975

点赞数

分类专栏：大数据文章标签： hive

本文链接：https://blog.csdn.net/weixin_42438913/article/details/117221374

版权

大数据专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1、时间函数

1.1) current_date取得当前日期：

select current_date();
2021-05-24

1.2) current_timestamp取得当前日期和时间：

select current_timestamp();  
2021-05-24 14:03:54.266

1.3) unix_timestamp取得当前时间戳：

select unix_timestamp();
1621836402

1.4）from_unixtime 时间戳转日期

select from_unixtime(1617725479,'yyyy-MM-dd HH:mm:ss');
2021-04-06 16:06:19

1.4）to_unix_timestamp（unix_timestamp）日期转时间戳

select to_unix_timestamp('2021-01-01 12:12:12','yyyy-MM-dd HH:mm:ss');
1610452812

2、时间加减函数

2.1）date_add/date_sub 指定日期加减指定天数(当然可以直接传负数)

select date_add('2020-03-22',1);
2020-03-23
select date_sub('2020-03-22',1);
2020-03-21

2.2）datediff 取得两个日期之间差值（差值为天数

select datediff(from_unixtime(unix_timestamp(current_timestamp()),'yyyy-MM-dd'),date_add(from_unixtime(unix_timestamp(current_timestamp()),'yyyy-MM-dd'),-10));
10

2.3) next_day 获取下个指定星期几的日期

hive> select next_day('2021-05-24','sun');
OK
2021-05-30

#当前日期星期几的日期
hive>select date_add(next_day('2021-05-24','sun'),-7);
OK
2021-05-23

hive> select next_day('2021-05-24','mon');
OK
2021-05-31

2.4）查找某天对应周几
pmod为取余，对应周几可以直接 x/7取余（0-6），

select pmod(datediff('2021-05-024','1990-01-01'),7)+1;
OK
1

在这里插入图片描述
2.4）from_utc_timestamp 把UTC标准时间切换到北京时间

select from_utc_timestamp('1970-01-01 00:00:00','PRC') as bj_time;
1970-01-01 08:00:00

2.5）to_utc_timestamp把北京时间切换到UTC标准时间

select to_utc_timestamp('1970-01-01 08:00:00','PRC') as utc;
1970-01-01 00:00:00

2.6）to_date 字符串转时间（字符串必须为：yyyy-MM-dd格式）

select to_date('2018-02-03 12:12:12');
2018-02-03

2.7）date_format 日期、时间戳、字符串类型格式化输出标准时间格式

select date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss');
2021-05-24 15:11:37
select date_format(current_date(),'yyyy-MM-dd');
2021-05-24

3. row_number()

row_number() over(partition by 分组列 order by 排序列 desc降序/asc升序)
在使用over等开窗函数时，over里头的分组及排序的执行晚于“where，group by，order by”的执行

1）创建测试表

create table test_sql(
       id int,
       name string,
       age int,
       salary int,
       department string,
       entrydate date      
);
insert into test_sql(id,name,age,salary,department,entrydate) values(1,'a',10,8000,'财务','2020-02-12');
insert into test_sql(id,name,age,salary,department,entrydate) values(1,'a2',11,6500,'技术','2019-12-15');
insert into test_sql(id,name,age,salary,department,entrydate) values(2,'b',12,13000,'技术','2016-04-23');
insert into test_sql(id,name,age,salary,department,entrydate) values(2,'b2',13,4500,'后勤','2019-03-22');
insert into test_sql(id,name,age,salary,department,entrydate) values(3,'c',14,3000,'后勤','2019-03-22');
insert into test_sql(id,name,age,salary,department,entrydate) values(3,'c2',15,20000,'技术','2017-04-23');
insert into test_sql(id,name,age,salary,department,entrydate) values(4,'d',16,30000,'技术','2015-04-23');
insert into test_sql(id,name,age,salary,department,entrydate) values(5,'d2',17,1800,'技术','2021-03-26');

2）根据薪资排序

select *,row_number() over(order by salary desc) num from test_sql;

id      name    age     salary  department      entrydate       num
4       d       16      30000   技术    2015-04-23      1
3       c2      15      20000   技术    2017-04-23      2
2       b       12      13000   技术    2016-04-23      3
1       a       10      8000    财务    2020-02-12      4
1       a2      11      6500    技术    2019-12-15      5
2       b2      13      4500    后勤    2019-03-22      6
3       c       14      3000    后勤    2019-03-22      7
5       d2      17      1800    技术    2021-03-26      8

3)根据id分组后对薪资进行排序

select *,row_number()over(partition by id order by salary desc) num from test_sql;

id      name    age     salary  department      entrydate       num
2       b       12      13000   技术    2016-04-23      1
2       b2      13      4500    后勤    2019-03-22      2
3       c2      15      20000   技术    2017-04-23      1
3       c       14      3000    后勤    2019-03-22      2
1       a       10      8000    财务    2020-02-12      1
1       a2      11      6500    技术    2019-12-15      2
4       d       16      30000   技术    2015-04-23      1
5       d2      17      1800    技术    2021-03-26      1

4）获取不同id的最高薪资

select * from (select *,row_number()over(partition by id order by salary  desc) rank from test_sql) t1  where t1.rank=1;

id      name    age     salary  department      entrydate       rank
2       b       12      13000   技术    2016-04-23      1
3       c2      15      20000   技术    2017-04-23      1
1       a       10      8000    财务    2020-02-12      1
4       d       16      30000   技术    2015-04-23      1
5       d2      17      1800    技术    2021-03-26      1

4、 rank()

和row_number()函数一样，只是当排序的时候会存在并列排名

insert into test_sql values(4,'d1',18,4500,'后勤','2020-05-21');
select *,rank() over(order by salary desc) rank from test_sql;

id      name    age     salary  department      entrydate       rank
4       d       16      30000   技术    2015-04-23      1
3       c2      15      20000   技术    2017-04-23      2
2       b       12      13000   技术    2016-04-23      3
1       a       10      8000    财务    2020-02-12      4
1       a2      11      6500    技术    2019-12-15      5
2       b2      13      4500    后勤    2019-03-22      6
4       d1      18      4500    后勤    2020-05-21      6
3       c       14      3000    后勤    2019-03-22      8
5       d2      17      1800    技术    2021-03-26      9

会有2个排序号为6，并且会直接跳过7序号

5、 dense_rank

和rank()函数一样，只是当排序的时候会连号排序

select *,dense_rank() over(order by salary desc) rank from test_sql;

4       d       16      30000   技术    2015-04-23      1
3       c2      15      20000   技术    2017-04-23      2
2       b       12      13000   技术    2016-04-23      3
1       a       10      8000    财务    2020-02-12      4
1       a2      11      6500    技术    2019-12-15      5
2       b2      13      4500    后勤    2019-03-22      6
4       d1      18      4500    后勤    2020-05-21      6
3       c       14      3000    后勤    2019-03-22      7
5       d2      17      1800    技术    2021-03-26      8

和rank使用一样，不过它的需要是连续的，不会像rank一样直接跳过7序号

6、 ntile

会数据根据指定的分区数分区

select *,ntile(1) over(partition by id order by salary desc) rank from test_sql;

id      name    age     salary  department      entrydate       rank
2       b       12      13000   技术    2016-04-23      1
2       b2      13      4500    后勤    2019-03-22      1
3       c2      15      20000   技术    2017-04-23      1
3       c       14      3000    后勤    2019-03-22      1
1       a       10      8000    财务    2020-02-12      1
1       a2      11      6500    技术    2019-12-15      1
4       d       16      30000   技术    2015-04-23      1
4       d1      18      4500    后勤    2020-05-21      1
5       d2      17      1800    技术    2021-03-26      1

select *,ntile(2) over(order by salary desc) rank from test_sql;
id      name    age     salary  department      entrydate       rank
4       d       16      30000   技术    2015-04-23      1
3       c2      15      20000   技术    2017-04-23      1
2       b       12      13000   技术    2016-04-23      1
1       a       10      8000    财务    2020-02-12      1
1       a2      11      6500    技术    2019-12-15      1
2       b2      13      4500    后勤    2019-03-22      2
4       d1      18      4500    后勤    2020-05-21      2
3       c       14      3000    后勤    2019-03-22      2
5       d2      17      1800    技术    2021-03-26      2

很像表的分区数，对排序后的数据进行分区

待测试:
聚合函数（sum、avg、count、max、min、first_value、last_value）
/** sum() over([partition by prov_name [ORDER BY val_cnt]]) */
有order by；按照排序连续累加；无order by，计算partition by后的和；over()中没有partition by，计算所有数据总和

7、NVL(E1, E2)

如果E1为NULL，则函数返回E2，否则返回E1本身

select nvl(null,1);
_c0
1
#第一个值不是null,则返回本身
select nvl(11,2);
_c0
11

8、coalesce(E1,E2,E3,…)

返回其参数中的第一个非null表达式，

select coalesce(null,2,null,1,null);
_c0
2

9、炸裂函数explode(col)

会把一个集合中的每个元数炸裂成一行

create table explodetest(id int , a array<int>);
--数组数据不能直接into插入需要查找数据插入
insert into explodetest select 1,array(2,3,4,5);
--这里用到侧lateral view侧写函数 LATERAL VIEW explode (a) 临时表名 as 临时字段名
select id,aid from explodetest LATERAL VIEW explode (a) t as aid;

hive优化问题

hive调优涉及到压缩和存储调优，参数调优，sql的调优，数据倾斜调优，小文件问题的调优等
1) 压缩和数据格式的选择
主流的压缩方式大多是lzo（一般需要做索引，但是可能对数据count（*）有数据差问题比如多1条数据，INPUTFORMAT设计有关系）和Snappy
数据格式一般选择是orc和parquet
一般选择orcfile/parquet + snappy 方式
2) hive优化参数

#select *,limit,order by等可以直接fetch数据
set hive.fetch.task.conversion=more; #默认是minimal，是查询都要经过mr的
#开启任务并行执行,当然需要资源充足的情况下 可参考 https://blog.csdn.net/qq_43193797/article/details/90720381
set hive.exec.parallel=true; #默认false 
set hive.exec.parallel.thread.number=8; #默认最大并行执行线程数8
#设置jvm重用,task比较多或者小文件比较多的时候能极大提高新能
set mapred.job.reuse.jvm.num.tasks=10; #默认1
#合理设置reduce的数目
#方法1：调整每个reduce所接受的数据量大小
set hive.exec.reducers.bytes.per.reducer=400000000; （400M）
#方法2：直接设置reduce数量
set mapred.reduce.tasks = 25
#map端预聚合，降低传给reduce的数据量
set hive.map.aggr=true  
#开启hive内置的数倾优化机制,生成的查询计划会有两个MR的Job,第一个job的map的输出结果集合会随机分布到reduce中，
#每个reduce做部分聚合操作,这样可能相同的key会分发到不同的reduce,从而实现负载均衡的目的
#然后进行第二个job,这个job会把 Group By Key 分布到相同的reduce,最终完成作业
set hive.groupby.skewindata=true

3) sql优化
1、where条件优化

优化前（关系数据库不用考虑会自动优化）

select m.cid,u.id from order m join customer u on( m.cid =u.id )where m.dt='20180808';

优化后(where条件在map端执行而不是在reduce端执行从而加快执行效率）

select m.cid,u.id from (select * from order where dt='20180818') m join customer u on( m.cid =u.id);

2、union优化
尽量不要使用union （union 去掉重复的记录）而是使用 union all 然后在用group by 去重

3、count distinct优化
不要使用count (distinct cloumn) ,使用子查询

select count(1) from (select id from tablename group by id) tmp;

4、用in 来代替join

如果需要根据一个表的字段来约束另为一个表，尽量用in来代替join . in 要比join 快

select id,name from tb1  a join tb2 b on(a.id = b.id);
select id,name from tb1 where id in(select id from tb2);

5、优化子查询

尽量去除子查询内的 group by 、 COUNT(DISTINCT)，MAX，MIN。可以减少job的数量。

6、join 优化

Common/shuffle/Reduce JOIN （hive默认join模式）连接发生的阶段，发生在reduce 阶段，适用于大表连接大表(默认的方式)
Map join ：连接发生在map阶段，适用于小表连接大表

set hive.auto.convert.join=true;

大表的数据从文件中读取
小表的数据存放在内存中（hive中已经自动进行了优化，自动判断小表，然后进行缓存）

SMB join (Sort -Merge -Bucket Join)

有一些成立的前提条件:

首先: 两张表是分桶的，在创建表的时候需要指定：
CREATETABLE(……) CLUSTERED BY (col_1) SORTED BY (col_1) INTO buckets_Nums BUCKETS
其次：两张表分桶的列必须是JOIN KEY
最后：需要设置一些bucket相关的参数

set hive.auto.convert.sortmerge.join=true;  
set hive.optimize.bucketmapjoin = true;  
set hive.optimize.bucketmapjoin.sortedmerge = true;  
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

对大表连接大表的优化，用桶表的概念来进行优化。在一个桶内发生笛卡尔积连接（需要是两个桶表进行join）

zjliang2053

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
常用hivesql记录

hive> select next_day(‘2021-05-24’,‘sun’);OK2021-05-30Time taken: 0.099 seconds, Fetched: 1 row(s)hive> select next_day(‘2021-05-24’,‘mon’);OK2021-05-311. row_numberrow_number() over(partition by 分组列 order by 排序列 desc)测试表create table test_
复制链接

扫一扫