Hive 表常用查询语句-总结

GrowthDiary007

已于 2023-09-13 16:32:38 修改

阅读量8.5k

点赞数 4

分类专栏：并行计算 mapreduce 文章标签： python macos hive

于 2019-03-21 13:28:33 首次发布

本文链接：https://blog.csdn.net/XX_123_1_RJ/article/details/88712836

版权

并行计算同时被 2 个专栏收录

24 篇文章

订阅专栏

mapreduce

5 篇文章

订阅专栏

Hive之前不常用，每次都是现用现查，就是现在总结记下笔记，边学边记（下面都是一些简单的例子，由易到难吗）>_<。

1、基本的查询语句

现在假设有数据库 db，数据表table1，table2，

--1、查看表的创建信息：
show create table db.table1;

--2、查看表的分区信息：
show partitions db.table1;

--3、查看表的记录数：
select count(*) from db.table1 where dt = '2019-03-21';

--4、简单连接操作：
select t1.userid, t1.name, t2.score from 
(select userid, name from db.table1 where dt = '2019-03-21' ) t1
left join
(select userid, score from db.table2 where dt='2019-03-21') t2
on t1.userid=t2.userid;

--5、给字段起别名：
select userid as user_id from db.table1 where dt = '2019-03-19' ;

--6、求两个表的差集，出在表A中，但不能出现表B中，即 A-B：
select a.user_id from 
       (select user_id from db.table1) a
        left outer join
       (select user_id from db.table2) b
        on a.user_id = b.user_id 
where b.user_id is null ;

--7、条件查找 -- 模糊匹配 单一条件
例如 字段 text 查询包含 'BeiJing' 的列
where text like concat('%','BeiJing','%')
where text like '%BeiJing%'  -- 有待测试

--8、条件查找 -- 模糊匹配 多条件 查询
例如 字段 text 查询包含 'BeiJing' 'ShangHai' 的列
where text regexp 'BeiJing|ShangHai'

--9、字符串拼接 concat_ws()语句，一般有三个参数，第一个参数表示拼接的连接符
select item_id, CONCAT_WS('', "name:", name) from table.class;

2、hive的排序相关查询

1. 基本的排序操作

语句	功能	优点	缺点	备注
order by	与其他SQL一样，全局有序	保证全局有序	大数据下，会比较耗时	`------`
sort by	在reduce阶段进行排序	在reduce阶段进行排序，比较快	不能保证全局有序，除非只有一个reduce	使用sort by 之后，在进行全局排序会比较快
distribute by和sort by 联合使用	先聚合在排序，或者先分组在排序	`------`	`------`	`------`
cluster by	先聚合在排序，或者先分组在排序（同上）	`------`	`------`	`------`

注意： 关键字 asc 和 desc 表示升序和降序，其中 cluster by 指定的列只能降序
使用示例：

-- 1、 对单个字段，降序排序（如果是多个字段，就继续在后面追加即可）
   -- 按照年龄降序排序， sort by 使用方法与order by 一样。
select user_id, age from db.table order by age desc;


--2、 先按照班级class分组，在按照得分score、年龄age 升序排列
select class, age, score from db.table distribute by class sort by age asc, score asc;

--3、 先按照班级class分组，在按照年龄age 排列
select  class, age from db.table cluster by class sort by age;

2. 分组排序实现

一般有两种实现方式：

（1）row_number() over( partition by 分组字段 order by 排序字段) as rank(rank 可随起名，表示排序后标识)
（2）row_number() over( distribute by 分组字段 sort by 排序字段) as rank(rank 可随起名，表示排序后标识)

-- 注意： 
--1、 partition by  只与 order by 组合使用
--2、 distribute by 只与 sort by 组合使用
--3、 rank，可以随便起的名字，表示排序后的序号，例如，1,2,3,4,5... 
--4、 分组字段、排序字段，均可为多个字段。
--5、 分组字段设置为常量，例如为1，这时，仅可以获取按照排列字段，排序后的--序号。

使用示例：

--1、选取每个班级成绩前三名的同学:
select class, student, score from (
       select class, student, score, row_number() over (distribute by class sort by score desc) as rank from db.table1
       )as t1 
where t1.rank < 4;


--2、distribute by，后面可以跟常数，例如1，这样只是获取按照某一列排序后的标识：
select class, student, score, row_number() over (distribute by 1 sort by score desc) as rank from db.table1;

参考链接：1基本排序、2分组排序

3. 从全量表数据获取增量数据

select a.id from
(select distinct id from db.table1 where dt='2020-05-27') a
left outer join
(select distinct id from db.table1 where dt='2020-05-26') b
on a.id=b.id
where b.id is null

4. 获取表的最新分区

partition="show partitions db.table1;"
latest_info=$(hive -e  "$partition" | sort | tail -n 1)
latest_dt=${latest_info:3:13}
echo $latest_dt

5. 去重后统计行数

select count(*) from (select distinct id from ab.table where dt='2020-05-26') a

6. Hive SQL 使用 explode 获取 Array＜struct＞类型字段

# 例如字段 `ads_info` array<struct<ad_price:double, ad_id:bigint>>

select p_id, test.ad_price, test.ad_id from 表名 
		lateral view explode(ads_info) exploded_table as test
                where dt='2023-04-16' 
                and test.ad_price=1.5
                and test.ad_id=10086
                group by p_id, test.ad_price, test.ad_id;
# 如果是嵌套多层可以
select p_id, ads_info_table.ad_price, ads_info_table.ad_id, sub_info_table.sub_id from 表名 
		lateral view explode(ads_info) exploded_table as ads_info_table
		lateral view explode(ads_info_table.sub_info) exploded_table as sub_info_table
                where dt='2023-04-16' 
                and ads_info_table.ad_price=1.5
                and ads_info_table.ad_id=10086
                group by p_id, test.ad_price, test.ad_id;

7. Hive SQL 解析JSON类型字段

# get_json_object(ads_info, '$.image_url')
持续更新

声明： 总结学习，有问题或不当之处，可以批评指正哦，谢谢。