hive的基础知识以及参数调优

最新推荐文章于 2023-04-27 18:40:19 发布

_RyomaXu

最新推荐文章于 2023-04-27 18:40:19 发布

阅读量326

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/JAVA_Ryoma/article/details/81843697

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

hive的基础知识

数据类型

基本数据类型

数据类型	长度
tinyint	1byte
smalint	2byte
int	4byte
bigint	8byte
boolean	true 或者 false
float	单精度浮点型
double	双精度浮点型
string	字符序列用单引号或者双引号
timestamp	整数，浮点数，字符串
binary	字节数组

集合数据类型

数据类型	描述
struct	struct(first string,last string)第一个元素可以用.first来引用
map	相当于Java的map 可以用key来引用值 map(‘first’,’xx’,’last’, ‘zz’)第一个值可以用map[‘first’]引用
array	相当于Java的数组 array(‘xx’,’zz’) xx可以用array[0]引用

优点：提供更高吞吐量的数据，减少‘头部寻址’的次数
缺点：破坏标准格式，带来数据冗余

文本文件的数据编码

hive中默认的记录和字段分隔符

分隔符	描述
\n	换行符默认行与行之间的分割，也只有这种
^A	create table 用8进制的\001表示
^B	create table 用8进制的\002表示,用于分割array和struct的元素，map键值对
^C	create table 用8进制的\003表示,用于分割map键和值

排序

保留字	描述
order by	全局排序
sort by	局部排序
distribute by	控制map的输出在reduce怎样划分
cluster by	=distribute by t.id sort by t.id

hive参数调优

一、压缩

启动压缩有返回压缩的格式就是启动
set io.compression.codecs;
中间压缩（中间数据值上个mapreduce作业的输出）
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
结果输出进行压缩
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

文件格式	容量	压缩情况	备注
TEXT	585	无	数据有参数化的分隔符号，用textfile
RCFILE	505	14%	执行数据分析，并高效的存储数据，用rcfile
Parquet	221	62%	数据所在的文件的块尺寸小，用sequencefile
ORCFile	131	78%	支持事务，并且希望减少数据的存储空间，提高性能，用orcfile

二、连接

// (1) 自动连接当连接一个大表和小表，自动将小表缓存到本地，在map的阶段与大表进行连接，其次避免了hive查询中的倾斜连接
set hive.auto.conver.join=true;
set hive.auto.conver.join.noconditionaltask=true;
set hive.auto.conver.join.noconditionaltask.size=10000000;
set hive.auto.conver.join.use.nonstaged=true;
// (2) 倾斜连接两个大表进行连接，会先基于连接键进行排序，然后mapper将特点键值的所有行发给同一个reducer。
set hive.optimize.skewjoin=true;//是否在连接之后的倾斜创建独立的执行计划
set hive.skewjoin.key=100000;//
set hive.skewjoin.mapjoin.map.tasks=10000;//指定map连接的作业数，可控制粒度一起使用
set hive.skewjoin.mapjoin.min.split=33554432;//控制粒度
// (3)桶连接
set hive.optimize.bucktmapjoin=true; // 是否尝试桶map连接
set hive.optimize.bucktmapjoin.sortedmerge=true; //是否尝试在map连接中使用归并排序

三、优化limit操作

//默认的limit仍然会查询整个查询，然后再返回限定的行数
set hive.limit.optimize.enable=true;
set hive.limit.row.max.size=100000;
set hive.limit.optimize.limit.file=10;
set hive.limit.optimize.fetch.max=50000;