hive 常用sql语句整理

小小小小小小小小小小码农

已于 2023-06-08 10:27:34 修改

阅读量7.9k

点赞数 33

分类专栏：大数据文章标签： hive sql

于 2021-09-08 12:01:11 首次发布

本文链接：https://blog.csdn.net/sinat_38314794/article/details/120175748

版权

大数据专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文详细列举了Hive中常见的数据库操作，包括查看建表语句、刷新表、重命名表、增删改字段、字段位置调整、查询优化等。还介绍了处理array类型数据、json数据的方法，以及窗口函数row_number()、rank()的应用。同时，分享了如何使用groupingsets进行多维度统计，并展示了Struct类型数据的使用及json数据的读取。此外，提到了Trino查询Hive的场景及其限制。

摘要由CSDN通过智能技术生成

1.查看建表语句

show create table table_name;

-- 或者

desc formatted table_name;

2.刷新表

refresh table_name;

3.表重命名

alter table table_name rename to new_table_name;

4.增加列可以增加多列，用‘,’隔开

alter table table_name add columns (col_de string comment 'department name');

5.修改字段

alter table table_name change col_old_name col_new_name string comment 'the new name is string' 

如果表有分区，要加上 cascade；

alter table table_name change col_old_name col_new_name string comment 'the new name is string'  cascade;

否则数据可能就查不了了

6.调整字段位置

alter table table_name change col_old_name col_new_name string comment 'the new name is string'  after col_1

7.删除字段表中有a b c 字段，删除 c

alter table table_name replace columns ( a int, b int);

一般新增一列于调整字段位置一起配合使用

7 查看hive表信息

desc formatted provision_digital.provision_digital_store_rank

8.array<string> 类型的数据查询问题

只取第一条,网上搜到的例子这样查询可以

select id,id[0] from table;

实际查询没有作用，不生效，查出来的结果还是一个数组

select id,id[0],size(id)  from table; 
-- 或者 
select id,id[0] from table where size(id) > 0;

可以实现想要的效果。

9.建表语句

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name      
 
 [(col_name data_type [COMMENT col_comment], ...)]
 [COMMENT table_comment]
 [PARTITIONED BY(col_name data_type [COMMENT col_comment], ...)]
 [CLUSTERED BY (col_name, col_name, ...)
 [SORTED BY(col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
 [ROW FORMAT row_format]
 [STORED AS file_format]
 [LOCATION hdfs_path]

10.查看hive表hdfs文件

hdfs dfs -ls hdfs://xxx/xxx/xxxx/xxxx/xxxx/xxxx

11.分组排序 row_number()

select name,count,row_number() over (partition by name order by count) rn from xx.t_xxx_id
-- 分组排序后取第一条
select * from (
select name,count,row_number() over (partition by name order by count) rn from xx.t_xxx_id
) t where rn = 1;

12.跳序排序 rank()

select name,count, rank() over (order by count desc) as rank from xxx.t_table_xxx;
-- 如果两个排名第2，位于第三排名的序号为4

13.一行转多行

-- array_id  为array<string> 字段类型，或者字符串用逗号拼接
select name,array_id from xxx.t_table_xxx;

-- array<string>
select select name,id from (
select name,array_id from xxx.t_table_xxx
) t LATERAL VIEW explode(t.array_id) array_id as id
;

-- 字符串，先用 split将字段分隔为数组，再用上面的方法
select name,split(array_id,',') from xxx.t_table_xxx;

14,字段类型转 cast

cast(count1 as int )

15.多列转多行

源数据
select id,mobile1,mobile2,mobile3 from xxxx.xxxxxxx

转换
select a.id,b.idList,b.value from (
select id,mobile1,mobile2,mobile3 from xxxx.xxxxxxx
) a lateral view explode (map('mobile1',mobile1
,'mobile2',mobile2
,'mobile3',mobile3)) b as idList, value

16.多行转多列

-- mobile1 为map中的key kv1[xx] 既取出 key为xx的 value值
-- eg: {1=2, 2=2, 3=3, 4=4} ； kv1[4] 查询结果为 4
-- map里面为字符串时，需要加引号
select brand, 
	kv1['mobile1'] mobile1,
	kv1['mobile2'] mobile2,
	kv1['mobile3'] mobile3
 from (
	select brand, str_to_map(concat_ws(',', collect_set(concat(idList, '-', value))),',','-') kv1
  	from (
		select brand, idList, value
		from xxxx.xxxxxxx 
	) tmp
   group by brand
 ) t

17. jsonarray jsonobject

不使用自定义udf函数，用hive collect_set、concat_ws，concat函数

collect_set 和collect_list 两个都可以用，set 会去重

concat 作为字符串拼接函数，如

concat("[",xxxxx,"]") xxxxx

输出的结果就是 [xxxxxx]

concat_ws 指定字符串直接拼接的连接符

concat_ws("-",xx)

输出结果，xx-xx-xx

示例代码：

-- 输出格式 ["xxxx","xxxx"]

select id
       , concat("[",concat_ws(",",collect_set(xxxx)),"]") xxxx
from (
    select  id
            , concat("\"",xxxx,"\"")    xxxx
    from xxxxx.xxxxxxx
) t group by id
;

-- 输出格式 ["{xxxx}","{xxxx}"]

select id
       , concat("[",concat_ws(",",collect_set(xxxx)),"]") xxxx
from (
    select  id
            , concat("{",xxxx,"}")    xxxx
    from xxxxx.xxxxxxx
) t group by id
;

18 to_json （）

可以将array类型数据直接转成 json串，注意低版本不支持

19 range between /rows between

可以用来统计类似计算每天前30天的总量

可以将hive表的日期转为时间戳，用时间戳进行range between 比较

如：

-- 表 test_table
-- dt 分区，gc 每天的总数，id 用户 ，dt_1 每天的时间戳
select id
     , sum(gc) over(partition by mid order by dt_1 range between 2592000 PRECEDING AND 0 FOLLOWING) as gc_last30days
     , dt
from test_table


-- 常用边界

-- unbounded 无边界
-- preceding 往前
-- following 往后
-- unbounded preceding 往前所有行，即初始行
-- n preceding 往前n行
-- unbounded following 往后所有行，即末尾行
-- n following 往后n行
-- current row 当前行

PRECEDING 当前减去

FOLLOWING 当前加上

rows between 类似用法，只是这个统计的是行数。

20。获取某月的第一天

select trunc(current_date(),'MM');

select trunc('2022-04-30','MM');

返回 2022-04-01

trunc()也可以用来取整如：

trunc(123.456) 得到123 也可以指定截取位数，默认0 ，即只取整数

还有其他方式获取：

select concat(SUBSTR(DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()),DAY(FROM_UNIXTIME(UNIX_TIMESTAMP()))),1,7) ,'-01')
，concat(substr(add_months(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),-1),1,7),'-01')

21 获取某月最后一天

select date_sub(trunc(current_date,'MM'),1),
       date_format(last_day(add_months(current_date(),-1)),'yyyyMMdd'),
       DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP()),DAY(FROM_UNIXTIME(UNIX_TIMESTAMP())))
;

trunc 获取某月第一天，再用date_sub 减去1天，即上某个月的最后一天

总结：获取上个月第一天和上个月最后一天，并且格式转成yyyyMMdd

select date_format(trunc(add_months(current_date,-1),'MM'),'yyyyMMdd'),date_format(date_sub(trunc(current_date,'MM'),1),'yyyyMMdd');

22 分组 grouping sets

同一张表，t1表，同一时间范围内，分别统计字段，(a,b)，(a,c)，(a,c,d) 不同维度情况下的count值。这种情况可以使用 grouping sets 方法，只扫描表一次，提高执行效率

select a
     , b
     , c
     , d
     , count(1) as cnt
from t1
group by a, b, c, d grouping sets((a,b) ,(a,c) ,(a,c,d))

数据结果类似下图

再通过简单的筛选条件就可以得到各个维度的统计值

23 Struct 类型数据

组装map 结果或者 json结构数据。

使用 named_struct()函数，可以得到数据结构为 struct<string,string...>

select named_struct('id','1','name','xx1');

select array(named_struct('id','1','name','xx1'),named_struct('id','2','name','xx2'));

如果要将结构类的数组数据拆分开可以使用 explode() 函数

select explode(array(named_struct('id','1','name','xx1'),named_struct('id','2','name','xx2'))) as json;

一般explode函数需要用 lateral view

select *
from ( select array(named_struct('id','1','name','xx1'),named_struct('id','2','name','xx2')) as arr
    ) t lateral view explode(arr) js as jsons
;

结构数据读取，

select * ,jsons.id,jsons.name
from ( select array(named_struct('id','1','name','xx1'),named_struct('id','2','name','xx2')) as arr
    ) t lateral view explode(arr) js as jsons
;

24 json_tuple json 数据读取

json_tuple 参数必须全部为string类型，所以使用named_struct 函数得到的json结构数据不可以直接用

select json_tuple('{"id":"1","name":"xx1"}' ,'id','name');

不能解析json_array数据，可以先用explode数据拆分多行，再用json_tuple解析获取字段值

25 trino 查询hive

使用trino 连接hive查询可以实现快速查询，但是有很多方便的函数不能使用，在这里整理记录一下可以使用的函数。

查询json

json_extract_scalar(josn,'$.json_path')