大数据之hive的查询

最新推荐文章于 2024-05-06 09:27:27 发布

JeitZz

最新推荐文章于 2024-05-06 09:27:27 发布

阅读量378

点赞数

分类专栏： Hive 文章标签： hive 大数据 hadoop

本文链接：https://blog.csdn.net/JeitZz/article/details/116463177

版权

Hive 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

hive的Shell指令

1. hive> set 

2. hive -e “命令”      只执行一次

3. hive -S -e "set" | grep cil.print

4. hive -f /path/cat.sql

5. hive> source /path/cat.sql

6. hive> !pwd;    在hive内显示所在路径

7. hive> dfs -ls /;    在hive内部执行hdfs命令

分区和分桶：

分区和分桶都是为了提高hive的查询效率

分区

如果分区分层过多，效率降低，

partitioned by （first string,second string.....）多层分区

伪字段，物理上的分区

alter table part add partition() partition()    增加分区,无逗号

alter table part5 add partition(dt='2020-11-11') location
'/HDFSpath/dt=2020-12-12';                      添加分区以及数据

alter table part5 partition(dt='2019-03-21') set location
'/hdfs://master:8020/path/dt=2019-11-13';        修改分区HDFS的路径

alter table part drop partition()，partition()   删除分区，有逗号

set hive.exec.dynamic.partition.mode=nonstrict   设置非严格模式

动态分区加载用临时表(临时表有分区为字段)导入(insert into)，不用load加载

动态分区灵活，但是会使namenode和resourcemanage压力增大，所以尽量使用静态分区

分桶，mysql分区字段用的是表内字段；分区字段采用表外字段。

分桶

分桶默认跟mr的分区和reduce个数一致

set mapreduce.job.reduces = 桶数

clustered by   分桶

sorted by      分桶排序

真字段，逻辑上的，在分区基础上进行，解决分区的不能更细粒度的划分数据

本地方式查询(只有一个reduce的时候可以用)

set hive.exec.mode.local.auto=true

只能用insert into，要不然体现不出来分桶

注意：当表的内部指定了分桶，排序，使用分桶排序方式插入数据，以表内的分桶，排序为准

1.指定reduce个数和桶数一致
set mapreduce.job.reduces = 桶数

2.将数据导入一个临时表

3.
insert overwrite table buc6 
select id,name,age from temp_buc1 
distribute by (id) sort by (id asc);      可以自定义分桶和排序
和下面的语句效果一样 
insert overwrite table buc8 
select id,name,age from temp_buc1 
cluster by (id);                             同时分桶和排序

创建分桶和分桶查询

1.定义
	clusteerd by (id)           只分桶
	sorted by (id asc|desc)     分桶且排序
2.查询
	cluster by (id)             分桶查询且排序
	destribute by (id)          分桶查询
	sort by (name asc | desc)   排序

hive查询

查询方式

连接查询

内连接(inner join)
外连接(outer join)
左连接(left join)
右连接(right join)
全连接(full join)
半开链接(left semi join)   检测左表数据是否存在

合并结果集

union 去重且排序(按照第一列升序)  

union all 不去重不排序

子查询

查询字句

order by ：全局排序，reduce一个

sort by   ：按照分区排序，reduceTask是一个的时候和order by一样

distribute by ：用来确定用哪个列(字段)来分区,一般要写在sort by的前面 

cluster by：兼有distribute by以及sort by的升序功能。

JeitZz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据之hive的查询

hive的Shell指令1. hive> set 2. hive -e “命令” 只执行一次3. hive -S -e "set" | grep cil.print4. hive -f /path/cat.sql5. hive> source /path/cat.sql6. hive> !pwd; 在hive内显示所在路径7. hive> dfs -ls /; 在hive内部执行hdfs命令分区和分桶：分区和分桶都是为了提高hiv
复制链接

扫一扫