大数据干货系列（五）--Hive总结

最新推荐文章于 2024-06-02 11:35:27 发布

Shaun_Xi

最新推荐文章于 2024-06-02 11:35:27 发布

阅读量1k

点赞数 1

分类专栏： Hadoop系统架构文章标签： Hadoop hive 大数据干货分布式文件系统

本文链接：https://blog.csdn.net/weixin_39793644/article/details/78952228

版权

Hadoop系统架构专栏收录该内容

13 篇文章 2 订阅

订阅专栏

Hive总结

一、本质

Hive基于一个统一的查询分析层，通过SQL语句的方式对HDFS上的数据进行查

询、统计和分析。

二、四大特点**

• Hive本身不存储数据，它完全依赖HDFS和MapReduce，具有可扩展的存储能力和计算能力

• Hive的内容是读多写少，不支持对数据的改写和删除

• Hive中没有定义专门的数据格式，由用户指定

• Hive是一个SQL解析引擎，将SQL语句转译成MR Job

下例：Hive写的wordcount

三、HQL与SQL对比

四、Hive体系架构

可以将Hive体系分为三层，从上至下依次为用户接口、语句转换、数据存储

五、Hive建表

1.确认建内部表还是外部表：

– create table

删除表的时候，Hive将会把属于表的元数据和数据全部删掉

– create external table

在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，删除时仅仅删除表的元数据

2.Partition和Bucket

– Table可以拆分成partition，就像手机中的相册按照日期划分为一个个的小照片集，作用是缩小查询范围，加快检索速度

– Partition进一步可以通过”CLUSTERED BY“划分为多个Bucket，Bucket中的数据可以通过‘SORT BY’排序，作用是能提高查询操作效率（如mapside join），常用于采样sampling：

select * from student tablesample(bucket 1 out of 2 on id);

六、Hive的优化***

1.Map的优化

• 增加map的个数：

set mapred.map.tasks=10;

• 减少map的个数（合并小文件）：

set mapred.max.split.size=100000000;

set mapred.min.split.size.per.node=100000000;

set mapred.min.split.size.per.rack=100000000;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

• Map端聚合（combiner）：

hive.map.aggr=true;

2.Reduce的优化

• 设置reduce的个数：

set mapred.reduce.tasks=10;

• reduce任务处理的数据量

set hive.exec.reducers.bytes.per.reducer=100000;

• 避免使用可能启动mapreduce的查询语句

1)group by

2)order by(改用distribute by和sort by)

3.Join的优化

• Join on的条件：

SELECT a.val, b.val, c.val

FROM a

JOIN b ON (a.key = b.key1)

JOIN c ON (a.key = c.key1)

• Join的顺序：

/*+ STREAMTABLE(a) */ ：a被视为大表

/*+ MAPJOIN(b) */：b被视为小表SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val

FROM a

JOIN b ON (a.key = b.key1)

JOIN c ON (c.key = b.key1);

4.数据倾斜的优化

• 万能方法：

hive.groupby.skewindata=true

• 大小表关联：

Small_table join big_table

• 数据中有大量0或NULL：

on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null)

then concat('dp_hive_search',rand()) else x.uid

end = f.user_id;

• 大大表关联：

Select/*+MAPJOIN(t12)*/ *

from dw_log t11

join (

select/*+MAPJOIN(t)*/ t1.*

from (

select user_id from dw_log group by user_id

) t

join dw_user t1

on t.user_id=t1.user_id

) t12

on t11.user_id=t12.user_id

• count distinct时存在大量特殊值：

select cast(count(distinct user_id)+1 as bigint) as user_cnt

from tab_a

where user_id is not null and user_id <> ''

• 空间换时间：

select day,

count(case when type='session' then 1 else null end) as session_cnt,

count(case when type='user' then 1 else null end) as user_cnt

from (

select day,session_id,type

from (

select day,session_id,'session' as type

from log

union all

select day user_id,'user' as type

from log

)

group by day,session_id,type

) t1

group by day

5.其他的优化

• 分区裁剪（partition）：

Where中的分区条件，会提前生效，不必特意做子查询，直接Join和GroupBy

• 笛卡尔积：

Join的时候不加on条件或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积

• Union all：

先做union all再做join或group by等操作可以有效减少MR过程，多个Select，也只需一个MR

• Multi-insert & multi-group by：

从一份基础表中按照不同的维度，一次组合出不同的数据

FROM from_statement

INSERT OVERWRITE TABLE table1 [PARTITION (partcol1=val1)] select_statement1 group by key1

INSERT OVERWRITE TABLE table2 [PARTITION(partcol2=val2 )] select_statement2 group by key2

• Automatic merge：

当文件大小比阈值小时，hive会启动一个mr进行合并

hive.merge.mapfiles = true 是否和并 Map 输出文件，默认为 True

hive.merge.mapredfiles = false 是否合并 Reduce 输出文件，默认为 False

hive.merge.size.per.task = 256*1000*1000 合并文件的大小

• Multi-Count Distinct：

一份表中count多个参数（必须设置参数：set hive.groupby.skewindata=true;）

select dt, count(distinct uniq_id), count(distinct ip)

from ods_log where dt=20170301 group by dt

• 并行实行：

hive执行开启：set hive.exec.parallel=true

七、Hive案例

1.导入本地Local的数据，并进行简单统计

load data (local) inpath "" overwrite into table a1;

2.两表Join

select a.*, b.* from w_a a join w_b b on a.usrid=b.usrid;

3.UDF

• UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。

• 编写UDF函数的时候需要注意一下几点：

– 自定义UDF需要继承org.apache.hadoop.hive.ql.UDF。

– 需要实现evaluate函数

– evaluate函数支持重载

• 导出的jar包需要add后，才可以使用

4.利用Insert命令导入数据

insert into table test1 partition(c) select * from test2;

5.通过查询直接插入数据

create table test2 as select * from test1;

6.导出文件

insert overwrite (local) directory '/home/badou/hive_test/1.txt'

select usrid,sex from w_a;

7.Partition的使用

#1.建表

create TABLE p_t

( usrid string

age string

)

partition by (dt string)

row format delimited fields terminated by '\t'

lines terminated by '\n';

#2.插入数据

load data (local) inpath "" overwrite into table p_t partition(dt='20170302');

#3.查询数据

select * from p_t where dt='20170302';

以上.

如果觉得本文对你有帮助，可以帮忙点个赞表示支持吗，谢谢！

如果有任何意见和建议，也欢迎再下方留言~

关注这个公众号，每天推送给你三道大数据面试题~

点击这里查看往期精彩内容：

每日三问

大数据干货系列（一）--MapReduce总结

大数据干货系列（二）--HDFS1.0

大数据干货系列（三）-- Hadoop2.0总结

Shaun_Xi

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
大数据干货系列（五）--Hive总结

Hive总结一、本质 Hive基于一个统一的查询分析层，通过SQL语句的方式对HDFS上的数据进行查询、统计和分析。二、四大特点**• Hive本身不存储数据，它完全依赖HDFS和MapReduce，具有可扩展的存储能力和计算能力• Hive的内容是读多写少，不支持对数据的改写和删除• Hive中没有定义专门的数据格式，由用户指定• Hive是一个SQL解析引擎，将SQL语句转译成MR
复制链接

扫一扫