《HIVE使用指南》笔记

最新推荐文章于 2021-05-06 12:09:35 发布

Just Jump

最新推荐文章于 2021-05-06 12:09:35 发布

阅读量1.5k

点赞数 3

分类专栏： Hive & Hadoop 文章标签： hive使用

本文链接：https://blog.csdn.net/eylier/article/details/95955082

版权

Hive & Hadoop 专栏收录该内容

21 篇文章 3 订阅

订阅专栏

近期整理了工作中常使用的HQL，结合《HIVE编程指南》这本书做了汇总梳理，希望给将要踏入大数据分析行业的同学们作为参考手册使用，帮助大家尽快上手熟悉HQL。对于常用的命令、函数和功能，我基本从#含义、#用法、#优化、#注意四个方面做整理和备注。

HIveQL的使用知识点框架如导图，先看知识导图再详细深入知识点，会对HQL的使用和理解有帮助。

一、执行HIVE查询

#用法
hive -f /home/test.hql
或者使用source命令
hive > source /home/test.hql;
或者
hive -e"use {database}; select ** from **;exit; " --这种方式很适合嵌套在python或其他脚本中调用，常用
#注意：用户不退出hive执行shell或hdfs命令，命令前加! ,分号(;)结尾 , hive>!pwd
显示字段名称 hive> set hive.cli.print.header = true;

二、建表

假设有数据\t分割，分别代表名字、年龄、班级排名\市排名、各科成绩、家庭住址信息。示例：
gao 18 5,30 math:90,english:95,language:98 province:zhejiang,city:hangzhou,county:xihu,zip:310000

1、建表/外表/分区表

create (EXTERNAL) table if not exists test (name string COMMENT 'user name', ---字段名 COMMENT说明

age INT COMMENT 'user age' , rank ARRAY<int> , ---排名可用一个array结构来存放

scores MAP<string,float> COMMENT 'keys 是科目,values 是成绩', ---成绩选用map数据结构

address STRUCT<province:string, city:string, county:string,zip:int> COMMENT 'home address') ---住址用可用struct结构

COMMENT '如有必要可在这里添加表描述'

PARTITONED BY (country string,state string) --表分区

ROW FORMAT DELIMATED FIELDS TERMINATED BY '\t' --指定列字段间分隔符

COLLECTION ITEMS TERMINATED BY ',' --指定一个字段中各个item的分隔符

MAP KEYS TERMINATED BY ':' --对于Map和struct数据结构一定要指出数据key和value的分隔符

LINES TERMINATED BY '\n' ---行与行之间的分隔符

TBLPROPERTIES ('creator'='gao','created_at'='2019年 7月15日星期一 13时40分54秒 CST')

LOCATION '/user/hive/warehouse/gao.db/test' ---默认路径，选写

STORED AS TEXTFILE; --指定textfile方式存储，默认的存储方式

#查询
Arrays：通过0\1下标访问；Map(K-V对):通过["指定域名称"]访问；strcut内部的数据可以通过.来访问；
hive> select name,age,rank[0] as class_rank, rank[1] as city_rank, scores['math'] as math_score, address.province,address.city from test limit 10;

#创建一个指向HBase的hive表
create (external) table hbase_test(key int,name string,age int) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ('hbase.columns.mapping'=':key,cf:val') tblproperties('hbase.table.name'='stocks'); --hbase没有sql方言，使用hive来查询；hbase要求键排重唯一

#注意：(1）可正则表达式过滤表， show tables like 'test*';
(2）可设置严格模式对分区表查询必须有分区过滤 , set hive.mapred.mode = strict;
              (3)查询建表 create table test0 as select name,age from test;
           （4）其他存储格式
                      stored as ORC --改进的行列混合存储，stripes的概念，分块。
                      stored as RCFILE   --行列混合存储
                      stored as SEQUENCEFILE --序列化存储
                      stored as parquet --列式存储
              (5）压缩方式
                      snappy压缩:   set parquet.compression=snappy;
                      gzip压缩：    set hive.exec.compress.output=true;
                                             set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

2、查看分区、修改表

#用法

hive > show partitions test partition (country='US',state='AK'); --查看分区

hive>describe extended test partition (country='US',state='AK'); --查看分区信息

hive> describe formatted partition (country='US',state='AK'); --查看分区信息

hive> alter table test add if not exists partition(country='US',state='NY') location 'hdfs://**'

partition(country='US',state='ST') location 'hdfs://**'

partition(country='US',state='WD') location 'hdfs://**' ; --增加分区

3、修改列/增加列

hive> alter table test change column age salary float COMMENT 'salary' AFTER address; --把列age改成salary 类型改成float并插入到address字段后,其他字段顺位前移。 #注意表结构变了，但hdfs存储的数据不变，after可用来修正字段命名错位的情况。没有before的用法。

hive>alter table test add columns(app_name string COMMENT 'application name', session_id long); --增加两列

#注意：使用正则表达式查询所有列， hive> select name,'sala*' from test;

4、加载数据/导出数据

（1）加载数据

#加载本地数据

hive > load data local inpath '/home/test.txt' overwrite into table test partition (country='US',state='CA'); --本地数据覆盖写入分区。

#加载hdfs数据

hive > load data inpath 'hdfs://path/test.txt' overwrite into table test partition (country='US',state='CA'); --本地数据覆盖写入分区。

#注意：追加写入数据去掉overwrite, load data 的方式会copy或迁移数据到表所在的存储路径下。创建分区表就不会迁移。

#select 查询写入

hive> from test1 as t1

insert overwrite table test partition (country='US',state='OR') select * where t1.country='US' and t1.state='OR'

insert overwrite table test partition (country='US',state='CA') select * where t1.country='US' and t1.state='CA'

insert INTO table test partition (country='US',state='IL') select * where t1.country='US' and t1.state='IL';

--这里只扫描表test1一遍，就可以分别独立用where语句判断后写入不同的分区中，这是select查询的优化写法。

#注意：这里INSERT INTO 和 INSERT OVERWRITE可混用。

#动态添加分区

hive> set hive.exec.dynamic.partition=true; --开启动态分区功能

hive>set hive.exec.dynamic.partition.mode=nonstrict; --允许所有分区是动态的

hive>set hive.exec.max.dynamic.partitions.pernode=1000; --每个mapper/reducer可以创建的最大动态分区数

hive>insert overwrite table test partition(country,state) select * from test2;

（2）导出数据

#导出到本地

hive> insert overwrite local directory '/home/gao/' ROW FORMAT DELIMATED FIELDS TERMINATED BY '|' select * from test; --覆盖或创建本地路径，指定分隔符'|'；overwrite 换成into 追加写入文件，路径要存在

或者

hive -S -e "select * from test limit 100;" > /home/test.txt

#注意：-S 静默模式可以去掉输出信息，这里是重定向到本地文件中。

配置显示字段名

#hdfs导出

hadoop fs -get hdfs://**/text.txt /home/gao/test.txt --将hdfs数据导出到本地文件，不指定本地路径文件时会以hdfs文件命名

hadoop fs -getmerge hdfs://**/text*.txt /home/gao/test.txt --将hdfs文件合并后下载

三、HIVEQL查询

1、函数

（1）数学函数

round()\floor()\ceil()\rand()\exp()\log10()\log2()\log()\pow()\sqrt()\abs()\sin()\cos()\asin()\acos() \greatest()\least()等

（2）聚合函数/开窗聚合函数

count()\sum()\avg()\min()\max()\var_pop()\stddev_pop()\covar_pop()\corr()\first_value()\last_value()\cume_dist()\

#注意：普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此，和group by 一起使用，按照一列或多列对数据进行分组，然后对每个分组执行聚合函数的操作，每组返回一个值。而开窗函数则可为窗口(partition by)中的每行都返回一个值。

#用法
select studentId,math,departmentId,classId,
-- 以符合条件的所有行作为窗口 --这是hive 添加注释的方法
max(math) over() as max1,
-- 以按classId分组的所有行作为窗口
max(math) over(partition by classId) as max2,
-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口
max(math) over(partition by classId order by math) as max3,
-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口
max(math) over(partition by classId order by math rows between 1 preceding and 2 following) as max4
from student_scores where departmentId='department1';

#注意窗口： partition by 分区为窗口，order by 分区中排序在当前值前的为窗口，rows between 1 preceding and 2 following 分区中排序在当前值前一个和后两个的4个元素组成窗口。
over(partition by classId order by math rows between 1 preceding and 2 following)

(3)排序聚合函数

rank() ---允许并列排名、并为并列排名留空
dense_rank() ---允许并列排名、不会留空
row_number() ----统计排名行号
ntile() --分组排名
percent_rank() --计算给定行的百分比排名。可以用来计算超过了百分之多少的人。(当前行的rank值-1)/(分组内的总行数-1)

#用法

select studentid,departmentid,classid,math,

rank() over(partition by departmentId order by math desc) as rank1,
dense_rank() over(partition by departmentId order by math desc) as rank2,
row_number() over(partition by departmentId order by math desc) as rank3,
percent_rank() over(partition by departmentid,classid order by math) as rank4
from student_scores;

（4）条件函数

#if（条件表达式，结果1，结果2）

#说明: 当条件表达式为TRUE时，返回结果1；否则返回结果2

#用法：

hive > select count(if(math>90,studentid,NULL)),departmentid,classid from student_scores group by departmentid,classid;

#CASE 函数

case具有两种格式, 简单Case函数和Case搜索函数
语法:
--简单Case函数
CASE sex
WHEN '1' THEN '男'
WHEN '2' THEN '女'
ELSE '其他' END

--Case搜索函数
CASE WHEN sex = '1' THEN '男'
WHEN sex = '2' THEN '女'
ELSE '其他' END

#用法
select count(studentid), case when math>90 then '优秀'
when math>80 and math <=90 then '良好'
else '一般' end as level
from student_scores
group by case when math>90 then '优秀'
when math>80 and math <=90 then '良好'
else '一般' end as level; ---注意不能在where语句中使用列别名

或者和聚合函数一起使用：

select departmentid,count(case when math>90 then 1 else NULL end) ,
sum(cast(case when instr(name,'gao')>0 then language else '0' end as int)) as gaos_lag from student_scores group by departmentid;

#注意：case函数只返回第一个符合条件的值，剩下的case部分将会被自动忽略。
简单case函数比 case搜索函数功能相同，写法简单，但在写判断式上很受限制。

其他条件函数：isnull(), isnotnull()

（5）表生成函数

将单列扩展成多列或多行, explode() as\ json_tuple()\get_json_object()\parse_url_tuple()\lateral view explode() as，还可以自定义UDF。

#用法

hive> select explode(split(app_names,',')) as app from test; --explode将array类型的数据转成多行，一行一个数组值。不能同时选择其他列输出，如 select user,explode(split(app_names,',')) as app from test; 会报错！

hive>select user, app from test lateral view explode(split(app_names,',')) subview as app; --可以用视图lateral view 的形式解决同时选择多列的问题，但需要指定视图别名和生成的新列的别名，即这里的subview 和 app。

（6）集合函数

size(Map<K.V>) --求map的长度
    size(Array<T>) --求数组的长度
    array_contains(Array<T>, value) --如该数组Array<T>包含value返回true。，否则返回false
    sort_array(Array<T>) --按自然顺序对数组进行排序并返回
   find_in_set()
    map_keys(Map<K.V>) --返回map中的所有key
    map_values(Map<K.V>) --返回map中的所有value

（7）日期函数

select from_unixtime(1553052601,"yyyy-MM-dd"), unix_timestamp('2019-03-20 11:30:01') ; --时间互转| 2019-03-20 | 1553052601 |
select unix_timestamp(current_timestamp),from_unixtime(unix_timestamp(current_timestamp),'yyyy-MM-dd'),from_unixtime(unix_timestamp(current_timestamp),'yyyyMMdd'),to_date(current_timestamp); --时间格式之间的转化| 1563195378 | 2019-07-15 | 20190715 | 2019-07-15 |
select current_date,current_timestamp; --当天日期，当下时间戳 | 2019-07-15 | 2019-07-15 20:59:38.153 |
select year(current_date),year(current_timestamp),month("1970-11-01 00:00:00"), month("1970-11-01") ,day(current_timestamp),day(current_date),hour(current_timestamp),minute(current_timestamp),second(current_timestamp),weekofyear(current_date); --获取年、月、日、小时、分钟、秒、年周 #2019 | 2019 | 11 | 11 | 15 | 15 | 20 | 29
select date_format(current_date,'y'), date_format(current_date,'yy'),date_format(current_date,'yyyy'), date_format(date_sub(current_date, 10),'M'),date_format(date_sub(current_date, 10),'MM'),date_format(date_sub(current_date, 10),'d'),date_format(date_sub(current_date, 10),'dd'); --| 2019 | 19 | 2019 | 7 | 07 | 5 | 05 | 获取年、月、日
select trunc(current_date,'YYYY'); #返回当前年的第一天
select trunc(current_date,'MM'); #返回当前月的第一天
select last_day(current_date); --返回这个月的最后一天的日期2019-07-31
select date_add(current_date,10); --当前日期往后的第10天日期 2019-07-25

select add_months(current_date,1); --返回下月的这天2019-08-15
select date_sub(trunc(current_date,'MM'),1); --这个月第一天的前一天 2019-06-30
select date_sub(current_date,10); --当前日期往前10天 2019-07-05
select date_sub(date_add(current_date,10),10);
select datediff(current_date,date_add(current_date,10)); -- 两个日期时间差，前面是结束日期，后面是开始日期，结束日期-开始日期 = -10
select datediff(current_date,date_sub(current_date,10)); -- 10
select next_day(current_date, 'MON'); --返回当前时间的下一个星期X所对应的日期 ,周写前两个或前三个或英文全称，大小写都可以
select next_day(current_date,'WE'),next_day(current_date,'FRIDAY'); --| 2019-07-17 | 2019-07-19 |
select trunc(current_date,'YY'),trunc(current_date,'MM'); --| 2019-01-01 | 2019-07-01 |,返回时间的最开始年份或月份 ,注意所支持的格式为MONTH/MON/MM, YEAR/YYYY/YY
select months_between(current_date,add_months(current_date,3)), months_between(current_date,add_months(current_date,-3)), months_between(current_date,next_day(current_date,'FR')); --| -3.0 | 3.0 | -0.12903226 | --返回date1与date2之间相差的月份，如date1>date2，则返回正，如果date1<date2,则返回负，否则返回0.0

(8)字符函数
concat()\concat_ws()\decode()\encode\find_in_set\format_number\get_json_object\in_file\instr\length\locate\lower\lcase\lpad\ltrim\parse_url\printf()\regexp_extract\repeat()\reverse()\rpad()\rtrim()\split()\substr()\translate(input,from,to)\trim\upper\ucase

#用法
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
regexp_replace("foobar", "oo|ar", "") = 'fb.' 注意些预定义字符的使用，如第二个参数如果使用'\s'将被匹配到s,'\\s'才是匹配空格
substring_index('www.apache.org', '.', 2) = 'www.apache' #截取第count分隔符之前的字符串，如count为正则从左边开始截取，如果为负则从右边开始截取
将input出现在from中的字符串替换成to中的字符串如：translate("MOBIN","BIN","M")="MOM"

#get_json_object()

假设有个表user,有个字段address是json字段，存放家庭住址相关信息。通过get_json_object进行解析为：

hive>select user,get_json_object(ads,'$.province') as province, get_json_object(ads,'$.city') as city, get_json_object(ads,'$.code') as code from (select get_json_object(line,'$.data') as ads from user where get_json_object(line,'$.data') is not null) tmp; --以data作为标识嵌套json对象，取名为ads,再用json对象分成适当的列。

#注意：常用 concat_ws(',',collect_set(cast(price_close as string))) 将数组中的值拼接成字符串。collect\collect_list\collect_set。

（9）类型转换函数

binary(string|binary)\cast(expr as <type>)
select cast("1" as BIGINT);

#注意：也可参考，hive2.0函数大全 https://www.cnblogs.com/MOBIN/p/5618747.html

可以使用describe function查看函数间断介绍, describe function extended可以查看详细的。如： describe function sum;
describe function extended sum;

2、JOIN语句

(1)inner join

hive>select a.ymd,a.price_close,b.price_close,c.price_close from

stocks a join stocks b on a.ymd=b.ymd

join stocks c on a.ymd=c.ymd

where a.symbol='APPLE' and b.symbol='IBM' and c.symbol='GE'; --对多张表关联时，按照从左到右的顺序执行。每个 join都会启动一个MR任务，这里先启动一个MR对a 和b做连接，再启动一个MR对上一步的输出和c做连接。

#优化1：小表在前。

因为连接操作时会将前面的表缓存起来，再扫面后面的表进行计算。所以要保证表的大小是从左到右依次增加的。

hive> select /*+streamtable(a)*/ a.ymd,a.price_close,b.dividend from stocks a join dividend b on a.ymd=b.ymd and a.symbol=b.symbol and a.symbol='APPLE'; ----也可以使用streamtable来标记那张是大表。

#优化2: 使用map side join

可以将小表完全加载到内存中，在map端执行连接过程，省略掉连接操作的reduce过程。

hive>set hive.auto.convert.join = true; ---开启优化属性

hive>set hive.mapjoin.smalltable.filesize = 25000000; ---优化小表的大小，单位是字节。这是默认大小。

hive>select a.ymd,a.price_close,b.dividend from stocks a join dividend b on a.ymd=b.ymd and a.symbol=b.symbol and a.symbol='APPLE'; --开启优化属性后使用和inner join 一样

#注意：map side join是对 inner join 和 left outer join 的优化，其他的right outer join 和 full outer join 不支持。

hive>select /*+mapjoin(b)*/ a.ymd,a.price_close,b.dividend from stocks a join dividend b on a.ymd=b.ymd and a.symbol=b.symbol and a.symbol='APPLE'; ---很早以前的mapjoin标记形式

#优化3: 使用 left semi join

semi join 通常比 inner join 更高效，因为对于左表中指定的记录，在右表中一旦右匹配的记录，就会停止扫描。相当于 in 的查询。这也是缺点。

hive> select a.ymd,a.price_close,a.symbol from stocks a left semi join dividend b on a.ymd=b.ymd and a.symbol=b.symbol and a.symbol=b.symbol;

#注意： 内链接inner join 可以将分区过滤放在on中。inner join 不支持不等值<=查询，也不能支持on中使用OR查询。

(2) left outer join \ (full) outer join \ right outer join

#注意：外链接outer join 不能将分区过滤放在on 中，会被忽略。

#注意：where语句在连接操作后才会执行，因此where语句应该只用于过滤那些非NULL值的列条件。

(3)union all

#注意：对多个表进行合并时，每个union子查询都必须有相同的列数，且对应的每个字段的字段类型是一致的。

3、order by \ sort by \ distribute by

order by 对执行结果进行全局排序，所有数据通过一个reducer进行处理。

sort by 局部排序，只会在每个reducer中对数据进行排序，可以保证每个reducer中的输出数据是有序的，并非全局有序。

distribute by 控制map的输出到reducer上是如何划分的。可以保证需要的数据分发到同一个reducer上，然后结合sort by使用就可以对数据做出期望的排序。对于使用streaming特征和用户自定义聚合函数的情况很有用。

hive> select ymd,symbol,price_close from stocks distribute by symbol sort by symbol asc, ymd asc;

#注意: order by 和sort by 是reduce输出阶段排序，distribute by 是map输出阶段排序。当reducer个数是一个的时候，order by 和sort by效果一样。

4、其他注意事项

(1)不能在where语句中使用列别名 ， having语句才能对group by 语句产生的分组进行条件过滤。

参考case..when.. 中使用的方法或如下的方法：

hive> select a.* from (select year(ymd) as stock_year, avg(price_close) as avg_price from stocks where exchange='NASDAQ' and symbol='APPL' group by year(ymd)) a where a.avg_price>50.0; ---avg函数在where中不能使用，只能用一层嵌套来筛选；注意浮点型数据比较时的处理，round , float,cast()

或者用having 语句的方式：

hive>select year(ymd), avg(price_close) from stocks where exchange='NASDAQ' and symbol='APPL' group by year(ymd) having avg(price_close)>50.0; ----HAVING语句才能对group by 语句产生的分组进行条件过滤。

#注意：where语句中是可用字符转换函数和like搜索函数的。

（2）当查询语句很长很复杂的时候，可以使用视图来降低复杂度。视图只会在真正查询时才会执行查询语句。

hive> create view test1 as select * from test join stock on (test.id=stock.id) where stock.symbol='APPLE';

hive> select * from test1 where id=2; ---最好使用正常的建表方式或优化代码

(3)like/rlike

（4）为表创建索引

索引可以建立在表的分区、某些列上，使用时减少MR需要读取的数据量，理论上是能提升效率的。

四、streaming（常用）

#语法： map() \ reduce() \ transform(),建议使用transform()

#可以使用unix系统自带的cat 和sed这样的系统脚本程序

hive>select transform(col1,col2) USING '/bin/cat' AS (newA INT, newB double) from test; --transform返回的字符类型默认是string类型，可以通过指定转成其他数据类型

#python 脚本调用transform用法

这里有一个python脚本test.py,

hive> add file /home/gao/test.py; --调用方法同样适用于shell脚本
hive> add file /home/gao/city.txt;
hive> select transform(user,city,apps) USING 'python test.py city.txt' as user,app from test;--单次查询中不可执行多个transform

#streaming进行聚合计算 --个人实践中一个脚本做不到，按下面的方式可以。

#使用distribute by \sort by \cluster by 做并行执行优化

如有两个python脚本mapper.py ,reducer.py 。reducer.py对mapper.py的输出结果做聚合处理。

hive>create table test3 as select transform(a.word,a.count) USING 'reducer.py' as word,count from (select

transform(line) USING 'mapper.py' as word,count cluster by word) a; --cluter by 可用 distribute by/sort by 替换

hive>create table test3 as select transform(a.word,a.count) USING 'reducer.py' as word,count from (select

transform(line) USING 'mapper.py' as word,count distribute by word sort by word desc) a;

#注意：使用cluster by/distribute by sort by 可以将相同键的数据分发到同一个处理节点上，可以保证数据是有序的，也可以避免数据被分发到同一个reducer上，使任务并行执行，提高效率。

#注意：streaming API是为外部进程开启一个I/O管道，数据被传给这个进程，从标准输入读取数据，然后通过标准输出写结果，返回给streaming API job。效率会比写UDF低，因为多了一步数据在管道中序列化和反序列化。

UDF中，通过add file 可以将数据文件加载到分布式缓存中，小文件时处理非常效率。或者add jar把java文件加载到分布式缓存和类路径中。 Hdoop的设计理念有一项是转移计算而不转移数据，尽量将计算转移到数据所在位置。

创建临时函数如下：
hive>add file /app/gaoll/pkg_name.txt; --加载小文件放在分布式缓存中
hive>add jar hdfs://gaoll/lib-hive-api-func.jar; --加载jar包
hive>create temporary function set_contain as 'com.hive.udf.functions.collection.UDFSetContain'; --创建临时函数
hive>select pkg_name,set_contain(pkg_name,'pkg_name.txt') from test_table; --调用函数

#注意：transform可以和自定义的函数联用，如上述自定义函数和python脚本，

hive> select transform(md5(gid),pkg_name,set_contain(pkg_name,'pkg_name.txt')) USING 'python test_filter.py' as gid,pkg_name,state from hive_test_table;

五、调优

1、explain

打印语法树，查看查询语句如何转成MR任务。

#用法： hive> explain select avg(price_close) from stock;

hive>explain extended select avg(price_close) from stock; ---explain extended 可以输出更多信息

2、本地模式

（1）无需使用MapReduce的查询可用本地模式 ,小数据查询和计算时非常有效率。

hive> set hive.exec.mode.local.auto=true; ---这种方法设置更灵活，优于(2)

hive> select * from test where country='US' and state='CA' limit 100;

#注意：除了查询分区等，多数情况下limit语句是需要执行整个查询语句，再返回结果的，所以尽量避免这样使用。

（2）执行过程中临时启用本地模式,在单台机器或单个进程中处理任务

hive> set oldjobtracker =${hiveconf:mapred.job.tracker};

hive>set mapred.job.tracker=local;

hive>set mapred.tmp.dir=/home/gao/tmp;

hive>select * from stock where symbol='APPLE'; ---需要在每次查询时都要重新设置，因为下次查询会启动新的MRjob

3、并行执行

hive> set hive.exec.parallel = true; ---开启并发执行

4、严格模式

hive>set hive.mapred.model=strict; --可以禁止三种查询，where查询分区表必须带分区，order by 必须用limit,禁止使用笛卡尔积查询，否则会报错。

5、调整mapper /reducer 个数

hive> set hive.exec.reducers.bytes.per.reducer=750000000; ---默认值是1G；单位是字节。

#注意：hive根据数据量大小确定reducer个数，默认reducer属性值是1G，在估算map阶段数据量非常大的时候可以调小reducer属性值，增加reducer个数，增大并行数，反之也可以减少reducer个数，减少启动耗时。

6、调整mapper /reducer 内存

set mapreduce.job.name='gao'; --自行设置MR任务名
set mapreduce.map.memory.mb=4096; --设置map阶段内存
set mapreduce.reduce.memory.mb=4096; --设置map阶段内存
set mapreduce.map.java.opts=-Xmx3276m;
set mapreduce.reduce.java.opts=-Xmx3276m;

#注意：一般reduce的内存设置为map内存的75%左右比较合理

7、压缩

#中间压缩

hive>set hive.exec.compress.intermediate=true; --对中间数据进行压缩可以减少job中map和reduce任务间的数据传输量

hive>set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; --也可自行选择GZ压缩

#最终输出结果压缩

hive>set hive.exec.compress.output=true;

hive>set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; --这里选择了GZ压缩