通俗易懂的Hive知识分享

最新推荐文章于 2024-01-03 13:08:57 发布

zx_love

最新推荐文章于 2024-01-03 13:08:57 发布

阅读量630

点赞数

分类专栏：大数据 HIVE 文章标签：大数据 hive

本文链接：https://blog.csdn.net/zx_blog/article/details/106843532

版权

大数据同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

HIVE

5 篇文章 0 订阅

订阅专栏

hive sql

通过hive cli或者hive server2（实质上是jdbc连接）
hive cli：
hive -e “your sql” 执行sql并退出
hive -S -e “your sql” 静默模式，返回结果省去执行耗时、结果行数等信息
hive -f /xx/your_sql.hql 执行指定文件中的sql（进入hive shell模式时，可以使用source指定sql文件）

hive外部表与管理表（内部表）

管理表 —— hive控制着数据的生命周期（删除表时，数据会被删除），数据存储在默认的hive数据仓库目录。目录通过参数hive.metastore.warehouse.dir配置
外部表，使用location关键字指定数据目录，hive只负责管理表结构（表的元数据）。对外部表重命名时，不要直接使用rename，rename会导致数据位置发生变化，可以使用复制表结构来代替。

hive装载数据

管理表装载文件数据

load data (local) inpath ‘/your_path/’
overwrite into tabe your_table_name
(partitionn (par_key=‘par_value’) ) （overwrite关键字会覆盖原分区数据，如果没有指定分区，会覆盖全表数据）

使用 local 表示本地目录，否则为hdfs目录；
如果是本地目录（文件）会上传至hive的仓库的hdfs路径下，否则会将原hdfs目录下数据移动到hive数据仓库的hdfs路径下（不会拷贝）。

分区表添加分区形式装载数据：

alter table your_table_name add partition(par_key=‘par_value’)
location ‘hdfs://xxx’
此方法同时适用于外部表和管理表（location指令不会移动数据到hive warehouse路径）。

通过查询语句插入数据

1、单分区插入：
insert overwrite your_table_name
(partitionn (par_key=‘par_value’) ) （overwrite关键字会覆盖原分区数据，如果没有指定分区，会覆盖全表数据）
select * from src_table
where xx=‘par_value’;

2、多分区插入：
from src_table
insert overwrite table your_table_name
partition(par_key=‘par_value1’)
select * where src_table.xx=‘par_value1’
partition(par_key=‘par_value2’)
select * where src_table.xx=‘par_value2’
partition(par_key=‘par_value3’)
select * where src_table.xx=‘par_value3’

3、动态分区插入（hive严格模式下不支持动态分区）：
insert overwrite table your_table_name
partition (par_key)
select …, xx
from src_table
hive根据最后一列（多列-视分区字段个数）来确定分区。

hive数据导出

输出文件个数，取决于reducer个数
outputformat指定输出格式（org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat）

1、查询语句导出：
insert overwrite (local) directory ‘/your_path/’
select * …;

2、多路径输出：
from your_table_name
insert overwrite directory ‘/your_path/x1/’
select * where xx=x1
insert overwrite directory ‘/your_path/x2/’
select * where xx=x2
insert overwrite directory ‘/your_path/x3/’
select * where xx=x3

hive的几种排序

order by 全局排序（输出结果是有序的）；
sort by 分区内排序（每个分区下数据是有序的，输出结果可能无序）；
distribute by 控制map的输出结果在reduce端如何划分：按指定的字段排序进行shuffle reparation（maper到reducer）；
distribute by + sort by 按照map中指定的key进行分发到reduce，每个reduce中再按sort by指定的字段进行排序。（distribute by 要在 sort by之前）
cluster by = distribute by + sort by（distribute by 的key和sort by key相同）

总结 —— order by 最终结果有序；sort by、distribute by和cluster by 结果不能保证有序，其中 sort by 在只有一个reducer的时候结果有序。

hive分桶表

表分桶和分区一样都是对表中数据按指定字段进行划分；
不同之处：分桶需预先指定桶大小，按照指定key的哈希值，划分到各个桶中，每个桶的数据量相对比较均匀，指定的分桶字段是表中的字段；分区按照分区字段值进行划分，实际数据中不含分区字段，每个分区中数据可能不均匀。

建表：
create table your_table_name (uid string, user_name string, age string)
clustered by (uid) into 100 buckets;
插入数据：
使用hive.enforce.bucketing配置
set hive.enforce.bucketing=true;
insert overwrite table your_table_name
select uid, user_name, age
from src_table;
手动指定reducer数和cluster by 指定key
set maperd.reduce.tasks=100;
insert overwrite table your_table_name
select uid, user_name, age
from src_table
cluster by uid;

hive抽样查询

使用分桶操作进行桶内抽样

select * from your_table_name tablesample(bucket n out of m on col) 按照指定的col字段进行分m个桶，选择第n个桶。
select * from your_table_name tablesample(bucket n out of m on rand()) 按照随机数进行分m个桶，选择第n个桶。

数据块抽样

select * from your_table_name numbersflat tablesample(0.1 percent) 按数据块百分比进行抽样
这种抽样查询与数据存储格式有关系，最小抽取数据样例是一个hdfs数据块（默认128M）。

使用分桶表进行随机数据划分

例如：your_table_bucked是一个以bucket_key进分桶的表，桶大小为100

select * from your_table_bucked tablesample (bucket 2 out of 100 on bucket_key);（选取第2个桶）

hive视图

概念：

1、hive视图本质是sql（可以理解为查询语句的固化）；
2、视图为只读的（不可用insert、load这些命令），不可改变其数据和元数据；
3、一般用于简化查询sql、权限控制（视图sql进行条件过滤）；

视图的创建：

create view (if not exists) your_view_name as
select * from your_table_name where xx = ‘xxx’;

create view your_view_name1 like your_view_name

视图的查询

同表的查询；

视图的删除：

drop view (if exists) your_view_name;

hive索引

hive索引本质是个表
可以全表加索引，或指定分区加索引
create index your_index
on table your_table(your_index_col)
as ‘BITMAP’
with deferred rebuild

hive严格模式

hive.mapred.mode=strict
禁止三种查询：分区表全表查询（where条件不带分区字段）；order by结果不带limit；join操作不带on条件（笛卡尔积的查询）。

hive函数

数学函数；聚合函数（记录多条变一条）；表生成函数（记录一条变多条）
不一一枚举。

hive开窗函数

窗口函数 + over(partition by col1 order by col2) （指定字段分区/组或者指定字段排序）

count(1) + over(partition by col1) 按照col1字段进行分组计数，与count(1) + group by col1区别在于返回结果前者所有记录都展示出来，后者为聚合之后结果；

row_number() + over(order by col1) 给所有记录按col1排序并加上自增编号；
rank() + over(order by col1) 给所有记录按col1排序并加上序号（与row_number区别：相同col1序号相同）；

自定义函数

编写自定义函数原则：减少、避免创建对象，引用重用对象，一般不选择不可变类型的对象。（减少gc）

udf

自定义函数，输入一条记录，返回一条记录（类比Spark map）
继承UDF类，实现evaluate()方法

udaf

自定义聚合函数，输入多条记录的集合，返回一条记录（类比Spark aggregateByKey、reduceByKey）
继承UDAF类，实现方法：
init()初始化；
iterate()聚合逻辑，参数类型为真实接收数据类型；
terminatePartial()返回聚合中间结果；
merge()中间结果的聚合操作，接受参数对象与terminatePartial返回对象类型一致；
terminate()最终返回结果。

udtf

自定义表生成函数，输入一条记录，返回多条记录（类比Spark flatMap）

hive的查询优化

join优化

join大表放右边
join带上on条件（没有on，则为笛卡尔积，可以通过设置hive严格模式，强制限制不带on的join查询）

map-reduce优化

1、设置合理的map数和reduce数

map数和reduce数较小，会因为并行度不够，影响效率；
map数和reduce数较大，时间又会浪费在task的初始化上。
map数：取决于输入的文件数，可预先合并过多的小文件，或者拆分过大的文件（最理想的是在数仓搭建时，文件大小存储的合理）；
reduce数：
直接设置task数set mapred.reduce.tasks；
hive.exec.reducers.bytes.per.reducer每个reduce任务处理的数据量（reduce=总数据量/该参数）;
hive.exec.reducers.max每个任务最大的reduce数目;
reduce步骤拆分（逻辑优化）。

2、jvm重用（集群资源紧张时慎用，有可能导致已完成的task的插槽仍然一直占用不释放，直至整个任务结束）

mapred-site.xml配置：
mapred.job.reuse.jvm.num.tasks 设置插槽重用次数

3、并发执行（没有先后顺序的job会并发执行）

hive.exec.parallel=true

4、推测执行（集群资源紧张时慎用）

使用备胎task。

5、数据重用

一次加载数据map过程，把符合where条件的数据查询出来（写入指定表/路径）
sql数据如下：
from your_table
insert your_table/your_path
select * where col = ‘value’
insert your_table/your_path
select * where col = ‘value’

该思想也可用于group by优化中
需要设置 hive.multigroupby.singlemr=true

hive文件归档

对于冷数据可以进行hdfs文件归档，以减小name node的压力（归档文件后缀名为 .har ）
hive.archive.enable=true;
alter table your_table_name archive partition(par=‘par_value’);
alter table your_table_name unarchive partition(par=‘par_value’);