[hadoop hive] hive总结

最新推荐文章于 2024-05-16 20:12:02 发布

一饼团队

最新推荐文章于 2024-05-16 20:12:02 发布

阅读量3.4k

点赞数

本文链接：https://blog.csdn.net/seven_zhao/article/details/46520229

版权

1、表操作

建表（建表时需要注意前面不要添加空格回车之类的内容，防止各种异常）
create table if not exists employees(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/data/';

查看建表语句(经测试，hive0.9不支持下面的查看建表语句，hive0.14支持)
show create table employees;

格式化查看表结构
desc formatted employees;
如果需要查看详细信息，也可以使用desc employees;

删除表：drop table employees;
显示表：show tables;
显示数据库：show databases;
使用默认的数据库：use default;

表内容：
wang 123 a1,a2,a3 k1:1,k2:2,k3:3 s1,s2,s3,4
li 235 a4,a5,a6 k4:4,k5:5,k6:6 s4,s5,s6,9
zhao 878 b1,b2,b3 q1:1,q2:2,q3:3 f1,f2,f3,4

加载数据(从本地加载需要使用local，否则需要先将数据加载到hdfs中)
load data local inpath '/usr/local/opt/data/mydata' overwrite into table employees;

数据查询（查询数组、Map、struct中的内容）
select name,subordinates[0],deductions["k1"],address.city from employees;

根据一个表创建另外一张表
create table test1 like employees;
create table test2 as select name,address from employees;

hive不同文件读取对比
stored as textfile
①直接查看hdfs
②hadoop fs -text
stored as sequencefile
①hadoop fs -text
stored as rcfile
①hive -service rcfilecat path
stored as inputformat 'class'
①outformat 'class'

2、hive自定义jar包加载

方法一：将jar包copy到hive的lib目录下，然后重启客户端；
方法二：在hive客户端命令行中使用：add jar PATH;

3、分区表(创建表时所有的注释都要删除，否则创建表时会报错)

create table if not exists employees_c(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (country string,date string)//添加分区说明信息
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/data/';
对于已经创建完成的表添加分区信息
alter table employees add if not exists partition(country="cn",date="20150502")
对于已经存在表分区信息的表删除分区
alter table employees drop if exists partition(country="cn",date="20150502");
查看分区信息
show partitions employees;

4、桶表

create table bucketed_user(
id int,
name string
)
//按照Id进行聚集，按照name进行排序，放到4个桶里面
clustered by (id) sorted by (name) into 4 buckets
stored as textfile;

如果想要使桶表生效，先要执行如下命令，或者修改配置文件：
set hive.enforce.bucketing=true;

导入数据(直接使用load data不能将数据加载成桶表的格式)
insert overwrite table bucketed_user select sname,saddr from test1;

简单查询
select * from bucketed_user where id = 'a'

5、hive -help下的命令

hive -e "select * from user_table;"
这可以在Shell脚本中使用，不进入hive cli获取hive的查询结果,也可以使用hive -f sqlfilePath打到
同样的效果
hive -v -f a.txt > ./res.txt
-v保存到文件时同时会将sql语句打印到res.txt中

在hive cli中可以使用如下命令：
list jar; list file;//列出加载到hadoop集群缓存中的jar包和文件
source "/usr/local/opt/sql/hive_sql" //在命令行方式中执行HQL

6、hive变量

set val='';//设置变量
${hiveconf:val} //读取变量
环境变量 ${env:HOME},其中env是查看所有环境变量
在hive cli中使用
set val="test";
select * from employees where name=${hiveconf:val};
select ${env:HOME} from employees;

7、数据加载

内表数据加载
①创建表是加载
create table newTable as select col1,col2 from oldTable
②创建表时指定数据位置
create table tableName(...) location ''
③本地数据加载
load data local inpath 'localPath' [overwrite] into table tableName
注意：如果添加上overwrite表示覆盖重写，也就是删除原有数据，然后加载新数据
④加载HDFS数据
load data inpath 'hdfsPath' [overwrite] into table tableName
注意：加载HDFS数据操作是移动数据，不是复制数据
⑤还可以使用hadoop命令拷贝数据到指定位置（可以在hive的Shell中执行和Linux的Shell中执行）
在command中执行hadoop fs -copyFromLocal localPath hiveTablePath
然后再hive中查询就可以看到数据已经被加载了。

实际上Hadoop命令也可以在hive命令行中运行，可以使用dfs -ls /;所以上面的命令也可以使用
hive > dfs -copyFromLocal localPath hiveTablePath
⑥由查询语句加载数据
insert [into|overwrite] table tableName
selelct col1,col2
from table
where ...
方法二：
from table
insert [into|overwrite] table tableName
select col1,col2
where ...

注意：字段对应不同于一些关系型数据库，是按照顺序进行对应，而不是名称

外表数据加载
①创建表时指定数据位置
create external table tableName(...) location '..'
②查询加载和使用Shell操作同内表操作

分区表数据加载：
内部分区表数据加载方式类似于内表
外部分区表数据加载方式类似于外表
注意：数据存放的路径层次要和表的分区一致
如果分区表没有新增分区，即使目标路径下已经有数据了，但依然查不到数据
区别：加载数据指定目标表的同时，需要指定分区
加载数据时添加了partition
eg: load data local inpath 'linuxPath' overwrite into table tableName partition (pn='')
insert into table tableName partition (pn='') select col1,col2 from tableName2

hive数据加载需要注意的问题
①分隔符的问题，并且分隔符默认只有单个字符
②数据类型对应问题
load数据，字段类型不能相互转化时，查询返回NULL
select查询插入，字段类型不能相互转化时，插入数据为NULL
③select查询插入数据，字段值顺序要与表中字段顺序一致，名称可以不一致
hive在数据加载时不做检查，查询时检查
④外部分区表需要添加分区才能看到数据

8、可以在hive的Shell中使用hadoop命令/linux shell命令

hive> dfs -copyFromLocal /usr/local/opt/data1 /data;
dfs -ls /;
其他类似
使用Linux Shell命令（前面添加!）
eg: !ls /usr/local/;

9、hive数据导出

①hadoop命令
get eg:hadoop fs -get hdfsPath linuxPath(hadoop fs -get /data/* /usr/local/opt/my/data/)
text eg: hadoop fs -text hdfsPath > file
②通过insert。。directory方式
insert overwrite [local] directory '/linuxPath' [row format delimited fields terminated by '\t']
select name,salary,addr from employees;
如果不使用local，那么后面row format..这一句也就不支持
③Shell命令加管道： hive -f/e|sed/grep/awk > file
④第三方工具（sqoop）

10、hive动态分区

参数说明：
①set hive.exec.dynamic.partition=true;//使用动态分区
②set hive.exec.dynamic.partition.mode=nonstrict|strict;//nonstrict无限制模式，
如果模式是strict,则必须有一个静态分区，且放在最前面
③set hive.exec.max.dynamic.partitions.pernode=10000;//每个
节点生成动态分区的最大个数
④set hive.exec.max.dynamic.partitions=100000;//生成动态分区的最大个数
⑤set hive.exec.max.created.files=1500000;//一个任务最多可以创建的文件数目
⑥set hive.datanode.max.xcievers=8192;//限定一次最多打开的文件数

建议一个表一天产生的分区不要超过1000个，防止MySQL出现问题
hql:insert overwrite table dy_partition_table partition(分区字段（splitName）)
select name,addr as splitName from oldTable;

11、表属性的操作

修改表名称
alter table tableName rename to newTableName;
修改列名
alter table tableName change column c1 c2 int comment '..' after severity;
//默认放在最后，通过after可以把该列放在指定列的后面或者使用'first'放到第一位
eg: alter table employee change column type type string after address;
alter table employee change column type type string first;
增加列
alter table tableName add columns(c1 string comment '..',c2 long comment 'xx');
修改tblproperties
alter table tableName set tblproperties(property_name=property_value,property_name=property_value,...);
针对无分区表与有分区表不同
无分区表（修改字段内容分隔符）
alter table tableName set serdeproperties('field.delim'='\t');
注意：会导致之前存在的分区无法应用新修改的属性
有分区表（修改字段内容分隔符）
alter table test1 partition(dt='xx') set serdeproperties('field.delim'='\t');
修改location
alter table tableName [partition()] set location 'path'
内部表转外部表
alter table tableName set tblproperties('EXTERNAL'='TRUE');
外部表转内部表
alter table tableName set tblproperties('EXTERNAL'='FALSE');
可以在wiki：LanguageManual DDL中查看hive修改表操作
动态分区：
set hive.exec.dynamic.partition=true;//开启动态分区
如果set hive.exec.dynamic.partition.mode=nonstrict;
那么插入动态分区数据时可以不使用静态分区

eg:insert overwrite table test_part partition(dt,value)
select 'abc' as name,createDate as dt, addr as value from testext;
如果set hive.exec.dynamic.partition.mode=strict;
那么插入动态分区数据时，至少第一个分区是静态分区
eg:insert overwrite table test_part partition(dt='20150505',value)
select 'abc' as name, addr as value from testext;

12、hive高级查询

聚合操作
1)count计数
count(*) count(1) count(col)
count(*)如果一行中的所有值都为NULL，那么count(*)不加一
count(1)对于上面的这种情况，count（1）也会加一
2)sum求和
sum(可以转成数字的值)返回bigint
sum(col)+cast(1 as bigint)//总数加一
3)avg求平均值
avg(可以转成数字的值)返回double
4)distinct去重
count(distinct col)
Order by
select col1,col2, from test where condition order by col1,col2 [asc|desc]
注意：order by 后面可以有多列进行排序，默认按照字典排序
order by 为全局排序
order by 需要reduce操作，并且只有一个reduce，与配置无关
group by
按照某些字段的值进行分组，将相同的值放在一起
select col1[,col2],count(1),sel_expr(聚合操作) from table where condition
group by col1[,col2] [having]
注意：select后面非聚合列必须出现在group by中
除去普通列就是一些聚合操作
group by 后面也可以跟表达式,比如substr(col)
特性：使用了reduce操作，受限于reduce数量，设置reduce参数：set mapred.reduce.tasks=5;
输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关
问题：网络负载过重
数据倾斜，优化参数：set hive.groupby.skewindata=true;
join
两个表m,n之间按照on条件进行连接，m中的一条记录和n中的一条记录组成一条新的记录
join：等值连接，需要某个值在m和n中同时存在
left outer join:左外连接，左边表中的无论是否在右边表中存在时，都输出，右边表的值只有在左边表
中存在时才输出
right outer join:右外连接，和left outer join 相反
left semi join ：类似于exists
mapjoin:在Map端完成join操作，不需要使用reduce，基于内存做join，属于优化操作
说明：在Map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作
其中使用了分布式缓存技术
优缺点：不消耗集群的reduce资源（reduce资源相对紧缺）
减少了reduce操作，加快程序执行
降低网络负载

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次
生成较多的小文件
配置以下参数，由hive根据SQL字段选择common join还是mapJoin
set hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize默认值是25M
第二种方式，手动指定：
select /*+mapjoin(n)*/ m.col,m.col2,n.col3 from m join n on m.col = n.col;
其中不管/*,还是+都不能省略
mapjoin的使用场景
1）关联操作中有一张表非常小
2）不等值的链接操作
如果join发生数据倾斜，可以使用优化参数：set hive.optimize.skewjoin=true;
分桶
一般使用分区就足够了
对于每一个表（table）或者分区，hive可以进一步分桶，也就是说桶是更为细粒度的数据范围划分
hive是针对某一列进行分桶
hive采用队列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶中
好处:
获得更高的查询处理效率
使取样更高效
分桶的使用
select * from bucketed_user tablesample(bucket 1 out of 2 on id)
bucket join
需要先设置以下值才可以使用bucket join
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
连接两个在（包含连接列）相同列上划分了桶的表，可以使用Map端连接（Map-side join）高效的实现。比如
join操作。对于join操作两个表有一个相同的列，如果这两个表都进行了桶操作。那么将保存相同列值的桶进行
join操作就可以了，可以大大减少join的数据量。
对于Map端连接的情况，两个表以相同的方式划分桶。处理左边表内某个桶的mapper知道右边表内相匹配的行在
对应的桶内。因此，mapper只需要获取那个桶(这只是右边表内存储数据的一小部分)即可进行连接。这一优化方法
并不一定要求两个表桶的个数相同，两个表的桶个数是倍数关系也可以。
distribute分散数据
distribute by col//按照col列把数据分散到不同的reduce
sort排序
sort by col //按照col列把数据排序
select col1,col2 from m_table distribute by col1 sort by col1 asc,col2 desc;

一般distribute和sort结合出现，确保每个reduce的输出都是有序的
应用场景：
map输出的文件大小不均
reduce输出的文件大小不均
小文件过多
文件超大
对比：
distribute by 与group by
都是按照key值划分数据
都使用reduce操作
唯一不同，distribute by只是单纯的分散数据，而group by把相同的key的数据聚集到一起，
后续必须是聚合操作
order by 与 sort by
order by是全局排序
sort by 只是确保每个reduce上面输出的数据有序，如果只有一个reduce时，和order by作用一样
cluster by
把有相同值的数据聚集到一起，并排序
cluster by col效果等同于distribute by col order by col
union all
多个表的数据合并成一个表，hive不支持union
select col from ((select a as col from t1) union all (select b as col from t2))tmp
要求：
字段名字一样
字段类型一样
字段个数一样
字表不能有别名
如果需要从合并之后的表中查询数据，那么合并的表需要要有别名

13、函数

1)显示当前会话有多少函数可用
show functions;
2)显示函数的描述信息
desc function concat;
3)显示函数的扩展描述信息
desc function extended concat;
demo:
select cast(1.5 as int) from employee;//cast类型转换
其他内置函数参见hive函数手册

下面两个函数每个分区的第一个数总是从0开始的
cume_dist() over(partition b id order by money)
//((想通知最大行号)/(行数))
percent_rank() over(partition by id order by money)
//((相同值最小行号-1)/(行数-1))
混合函数
可以调用Java类和方法
java_method(class,method[,arg1[,arg2..]])
reflect(class,method[,arg1[,arg2..]])//java_method 和reflect是相同的

eg:select java_method("java.lang.Math","sqrt",cast(id as double)) from employee;
UDTF
表函数
lateralView:lateral view udtf(expression) tableAlias as columnAlias(',' columnAlias)*fromClause:
from baseTable (lateralView)*
例：explode函数：把一行内容拆分成多行
eg: select id ,adid from winfunc lateral view explode(split(type,'B')) tt as adid
正则表达式
下面两个例子是正则贪婪匹配和非贪婪匹配的对比,索引是按照小括号走的，0表示匹配全部
eg:select regexp_extract('979|7.10.80|8684','.*\\|(.*)',1) from employee limit 1;
结果为：8684
select regexp_extract('979|7.10.80|8684','(.*?)\\|(.*)',1) from employee limit 1;
结果为：979

14、用户自定义函数

UDF：用户自定义函数（user defined function）
针对单条记录
创建函数步骤
1)自定义一个Java类
2)继承UDF类
3)重写evaluate方法
4)打jar包
5)hive执行add jar add jar /usr/local/opt/jar.jar
6)hive执行创建模板函数 create temporary function bigthan as 'com.johnson.hive.udf.UdfTest';
7)hql中使用 select name1, bigthan(name1,500) from employee;
测试代码如下：
package com.johnson.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class UdfTest extends UDF {
/**
* 自定义evaluate方法，方法名固定，参数和返回值按照项目要求改变
* 如果t1>t2,return true
* else return false
* @return
*/
public boolean evaluate(Text t1,Text t2){
boolean flag = false;
if(t1!=null&&t2!=null){
double d1 = 0;
double d2 = 0;
try{
d1 = Double.parseDouble(t1.toString());
d2 = Double.parseDouble(t2.toString());
}catch(Exception e){}
if(d1>d2){
flag = true;
}
}
return flag;
}
}

UDAF：用户自定义聚合函数（user defined aggregation function）
针对记录集合
开发通用步骤：
1)第一个是编写resolver类，resolver负责类型检查，操作符重载
2)第二个是编写evaluator类，evaluator真正实现UDAF的逻辑
通常来说，顶层UDAF类继承org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,
类名编写嵌套类evaluator实现UDAF的逻辑
一、实现resolver
resolver通常继承org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,但是
建议继承AbstractGenericUDAFResolver，隔离将来hive接口的变化。GenericUDAFResolver
和GenericUDAFResolver2的接口区别是，后面的运行evaluator实现可以访问更多的信息，例如
distinct限定符，通配符function（*）
二、实现evaluator
所有的evaluator必须继承抽象类org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.
子类必须实现它的一些抽象方法，实现UDAF逻辑。
Mode：这个类比较重要，他表示了UDAF在MapReduce的各个阶段，理解Mode的含义，就可以理解UDAF的
运行流程。
下面是源代码：
public static enum Mode{
PARTIAL1,
PARTIAL2,
FINAL,
COMPLETE
}
PARTIAL1:这个是Mapreduce的Map阶段。从原始数据到部分数据集合，将会调用iterate()和
terminatedPartial()
PARTIAL2:这个是Mapreduce的Map阶段的？Combiner阶段，负责在Map端合并Map的数据，从
部分数据聚合到部分数据聚合，将会调用merge()和terminatedPartial()
FINAL:mapreduce的reduce阶段。从部分数据的聚合到完全聚合，将会调用merge和terminate
COMPLETE：如果出现了这个阶段，表示Mapreduce只有Map，没有reduce，所有Map端就直接出
结果了。从原始数据直接到完全聚合，将会调用iterate()和terminate()
可以看下源码中的sum/count聚合函数的实现
永久函数：
对于在hive Shell下是用add jar,只在当前Shell下有效，当Shell关闭在打开后，添加的临时UDF就会失效，
可以使用下面的方法将函数进行永久使用
1）如果希望在hive中自定义一个函数，并且能永久使用，可以修改源码添加相应的函数类，然后修改
ql/src/java/org/apache/hadoop/hive/ql/exec/Function/Registry.java类，添加相应的注册
函数代码
registerUDF("parse_url",UDFParseUrl.class,false);
这种方法一般用在集群刚刚搭建的时候，需要修改hive源代码，并从新编译打包
2）新建hiverc文件 ----这种方法比较常用
jar包放到安装目录下或者指定目录下
$HOME/.hiverc //在当前用户的$HOME目录下新建.hiverc文件(vim .hiverc或者touch .hiverc)
把初始化语句加载到文件中
在文件中加载初始化语句的demo：
-- add self functions //注释
add jar /usr/local/opt/extenal_jar/jar.jar; //添加jar文件
create temporary function bigthan as 'com.johnson.hive.udf.UdfTest'; //注册别名

15、hive SQL优化

join优化
set hive.optimize.skewjoin=true;如果是join过程出现倾斜，应该设置为true
set hive.skewjoin.key=100000; 这个是join的键赌赢的记录条数，超过这个值则进行优化
mapjoin
set hive.auto.convert.join=true;
hive.mapjoin.smalltable.filesize默认值是25M
select /*+mapjoin(A)*/ f.a,f.b from A t join B f on (f.a = t.a)
简单总结，mapjoin的适用场景
1)关联操作中有一张表非常小
2)不等值的链接操作
bucket join
使用条件：
两个表以相同方式划分桶
两个表的桶个数是倍数关系
create table order(cid int,price float) clustered by (cid) into 32 buckets;
create table customer(id int, first string) clustered by (id) into 32 buckets;
select price from order t join customer s on t.cid = s.id

join优化案例
优化前：select m.cid,u.id from order m join customer u on m.cid = u.id where m.dt='2013-01-01';
优化后：select m.cid,u.id from (select cid from order where dt='2013-01-01')m join customer u on m.cid = u.id;
原因：因为hive先执行join，然后执行where，这和关系型数据库里面sql执行的顺序是不一样的，所以
必须这样写,尤其是在表进行分区的情况下更明显
group by优化：
set hive.groupby.skewindata=true;如果group by过程中出现倾斜，应该设置为true
set hive.groupby.mapaggr.checkinterval=100000;这个是group的键对应的记录条数超过这个值后
就会进行优化
count distinct
优化前：select count(distinct id ) from tableName;
优化后：select count(1) from (select distinct id from tableName) tmp;
select count(1) from (select id from tableName group by id) tmp;
hive SQL优化
优化前：
select a,sum(b),count(distinct c),count(distinct d) from test group by a;
优化后：
select a,sumb(b) as b,count(c) as c,count(d) as d
from (
select a,0 as b,c, null as d from test group by a,c
union all select a,0 as b,null as c,d from test group by a,d
union all select a,b null as c,null as d from test) tmp1
group by a;

16、hive优化

目标：
在有限的资源下，提高运行效率
常见问题：
数据倾斜
Map数设置
reduce数设置
其他
hive执行顺序：HQL-》Job-》Mapreduce
执行计划：
查看执行计划：explain[extended] hql
demo:
select col,count(1) from test2 group by col;
explain select col,count(1) from test2 group by col;

17、hive表优化

分区
静态分区
动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
数据
相同数据尽量聚集在一起（可以降低网络数据负载）

18、hive MapReduce优化

job优化
并行化执行
每个查询被hive转化为多个阶段，有些阶段关联性不大，则可以并行化执行，减少执行时间
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
本地化执行
set hive.exec.mode.local.auto=true;
当一个job满足如下条件才能真正使用本地模式：
1）job的输入数据大小必须小于参数
hive.exec.mode.local.auto.inputbytes.max(默认是128M)
2）job的Map数必须小于参数：
hive.exec.mode.local.auto.tasks.max(默认4)
3）job的Reduce数必须为0或者1
job合并输入小文件
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
合并文件数由mapred.max.split.size限制的大小决定
job合并输出小文件
set hive.merge.smallfiles.avgsize=256000000;当输出文件平均大小小于该值，启动新job合并文件
set hive.merge.size.per.task=64000000;合并后的文件大小
JVM 重利用
set mapred.job.reuse.jvm.num.tasks=20;
jvm重利用可以使job长时间保留slot，知道作业结束，这在对于有较多任务和较多小文件的任务是非常
有意义的，减少执行时间。当然这个值不能设置过大，因为有些作业会有reduce任务，如果reduce任务
没有完成，则Map任务占用的solt不能施法，其他的作业可能就需要等待。
压缩数据
中间压缩：
中间压缩就是处理hive查询的多个job之间的数据，对于中间压缩，最好选择一个节省CPU耗时的
压缩方式。
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
hive查询最终的输出也可以压缩
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;

Map优化
set mapred.map.tasks=10;有时候会无效,原因是Map个数的计算公式如下:
1)默认Map格式
default_num = total_size/block_size;
2)期望大小
goal_num = mapred.map.tasks;
3)设置处理的文件大小
split_size = max(mapred.min.split.size,block_size);
split_num = total_size/split_size;
4)计算的Map的个数
compute_map_num = min(split_num,max(default_num,goal_num))
经过上面分析可知,在设置Map个数的适合,可以简单的总结为以下几点:
1)如果想增加Map个数,则设置mapred.map.tasks为一个较大的值 .
2)如果想减小Map个数,则设置mapred.min.split.size为一个较大的值.
情况1:输入文件size巨大,但不是小文件
增大mapred.min.split.size的值
情况2:输入文件数量巨大,并且都是小文件,就是单个文件的size小于blockSize.这种情况通过增
大mapred.min.split.size不可行,需要使用CombineFileInputFormat将多个input path
合并成一个inputSplit送给mapper进行处理,从而减少mapper的数量.
map端聚合
set hive.map.aggr=true;//相当于combiner
推测执行
mapred.map.tasks.speculative.execution

19. hive shuffle优化

Map端
io.sort.mb
io.sort.spill.percent
min.num.spill.for.combine
io.sort.factor
io.sort.record.percent
reduce端
mapred.reduce.parallel.copies
mapred.reduce.copy.backoff
io.sort.factor
mapred.job.shuffle.input.buffer.percent
mapred.job.reduce.input.buffer.percent

20. hive reduce优化

需要reduce操作的查询
聚合函数
sum/count/distinct/...
高级查询
group by, join, distribute by ,cluster by ..
order by 比较特殊,只需要一个reduce
推测执行
1)mapred.reduce.tasks.speculative.execution
2)hive.mapred.reduce.tasks.speculative.execution
这两种方式那种都可以
reduce优化
set mapred.reduce.tasks=10;//直接设置
hive.exec.reducers.max默认:999
hive.exec.reducers.bytes.per.reducer 每个reduce计算的文件大小,默认:1G
计算公式
numRTasks = min[maxReducers,input.size/perReducer] //使用的reduce的计算公式
maxReducers = hive.exec.reducers.max
perReducer = hive.exec.reducers.bytes.per.reducer

21.针对不同来源汇总的数据仓库

对于内容:
1)不同数据源进行处理
2)不同数据格式进行统一格式
3)不同来源数据统一字段
4)非统一字段使用集合
5)来自不同来源使用分区

一饼团队

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
[hadoop hive] hive总结

1、表操作建表（建表时需要注意前面不要添加空格回车之类的内容，防止各种异常）create table if not exists employees(name string,salary float,subordinates array,deductions map,address struct)row format delimited fields termi
复制链接

扫一扫