hive搭建和笔记_hive建表时path和location分别表示什么-CSDN博客

本文链接：https://blog.csdn.net/hjs19981227/article/details/107634823

hive搭建

hive三种方式区别和搭建

Hive中metastore（元数据存储）的三种方式：
a)内嵌Derby方式
b)Local方式
c)Remote方式

1.本地模式（derby）

这种方式是最简单的存储方式，只需要在hive-site.xml做如下配置便可

hive-site.xml配置

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=metastore_db;create=true javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.EmbeddedDriver hive.metastore.local true hive.metastore.warehouse.dir /user/hive/warehouse

注意

注：使用derby存储方式时，运行hive会在当前目录生成一个derby文件和一个metastore_db目录。这种存储方式的弊端是在同一个目录下同时只能有一个hive客户端能使用数据库，否则会提示如下错误
[html] view plaincopyprint?
hive> show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database ‘metastore_db’, see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database ‘metastore_db’, see the next exception for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database ‘metastore_db’, see the next exception for details.
NestedThrowables:
java.sql.SQLException: Failed to start database ‘metastore_db’, see the next exception for details.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

2.单用户模式（mysql）

这种存储方式需要在本地运行一个mysql服务器，并作如下配置（下面两种使用mysql的方式，需要将mysql的jar包拷贝到$HIVE_HOME/lib目录下）。

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> hive.metastore.warehouse.dir /user/hive_remote/warehouse hive.metastore.local true javax.jdo.option.ConnectionURL jdbc:mysql://localhost/hive_remote?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName hive javax.jdo.option.ConnectionPassword password

附：

安装mysql

Yum install mysql-server -y
修改mysql权限：
GRANT ALL PRIVILEGES ON . TO ‘root’@’%’ IDENTIFIED BY ‘123’ WITH GRANT OPTION;
flush privileges;
删除多余会对权限造成影响的数据
刷新权限

[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
错误的原因： Hadoop jline版本和hive的jline不一致

3.多用户模式

1.Remote一体

这种存储方式需要在远端服务器运行一个mysql服务器，并且需要在Hive服务器启动meta服务。
这里用mysql的测试服务器，ip位192.168.198.31，新建hive_remote数据库，字符集位latine1

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> hive.metastore.warehouse.dir /user/hive/warehouse javax.jdo.option.ConnectionURL jdbc:mysql://192.168.198.33:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName hive javax.jdo.option.ConnectionPassword password hive.metastore.local false hive.metastore.uris thrift://192.168.1.188:9083

注：这里把hive的服务端和客户端都放在同一台服务器上了。服务端和客户端可以拆开，

2.Remote分开

将hive-site.xml配置文件拆为如下两部分

1）、服务端配置文件

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> hive.metastore.warehouse.dir /user/hive/warehouse javax.jdo.option.ConnectionURL jdbc:mysql://192.168.57.6:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword 123456

2）、客户端配置文件

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> hive.metastore.warehouse.dir /user/hive/warehouse hive.metastore.local false hive.metastore.uris thrift://192.168.57.5:9083

启动hive服务端程序
hive --service metastore

客户端直接使用hive命令即可
root@my188:~$ hive
Hive history file=/tmp/root/hive_job_log_root_201301301416_955801255.txt
hive> show tables;
OK
test_hive
Time taken: 0.736 seconds
hive>

客户端启动的时候要注意：
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:101)
错误的原因： Hadoop jline版本和hive的jline不一致

Hive简介

Hive的产生：
	非java编程者对hdfs的数据做mapreduce操作
Hive : 数据仓库。
	Hive：解释器，编译器，优化器等。
	Hive 运行时，元数据存储在关系型数据库里面

hive框架图
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200728142912403.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2hqczE5OTgxMjI3,size_16,color_FFFFFF,t_70)

Hive的架构

（1）用户接口主要有三个：CLI，Client 和 WUI。其中最常用的是CLI，Cli启动的时候，会同时启动一个Hive副本。Client是 Hive的客户端，用户连接至Hive Server。在启动 Client模式的时候，需要指出Hive Server所在节点，并且在该节点启动Hive Server。 WUI是通过浏览器访问Hive。
（2）Hive将元数据存储在数据库中，如mysql、derby。Hive中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等。
（3）解释器、编译器、优化器完成HQL查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在HDFS中，并在随后有MapReduce调用执行。
（4）Hive的数据存储在HDFS中，大部分的查询、计算由MapReduce完成（包含*的查询，比如select * from tbl不会生成MapRedcue任务）。
在这里插入图片描述
Hive的架构
编译器将一个Hive SQL转换操作符
操作符是Hive的最小的处理单元
每个操作符代表HDFS的一个操作或者一道MapReduce作业

hive部署

安装mysql

运行hadoop集群
安装一个关系型数据（mysql）：yum  install mysql-server

安装Hive

配置环境变量：
HADOOP_HOME=
HIVE_HOME=
修改$HADOOP_HOME\lib目录下的jline-*.jar 变成$HIVE_HOME\lib下的jline-2.12.jar。
拷贝mysql驱动包到$HIVE_HOME\lib目录下

修改hive-site.xml

javax.jdo.option.ConnectionURL
jdbc:mysql://node1/hive

javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName
root

javax.jdo.option.ConnectionPassword
123456

启动hive

bin/hive

hive sql

Hive完整的DDL建表语法规则

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name – (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, …) – (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)] – (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, …)] – (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; – (Note: Available in Hive 0.5.0 and later; not supported for external tables)

Hive 内部表

CREATE TABLE [IF NOT EXISTS] table_name
删除表时，元数据与数据都会被删除

Hive 外部表

CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path
删除外部表只删除metastore的元数据，不删除hdfs中的表数据

Hive 建表

CREATE TABLE person(
id INT,
name STRING,
age INT,
likes ARRAY,
address MAP<STRING,STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
COLLECTION ITEMS TERMINATED BY ‘-’
MAP KEYS TERMINATED BY ‘:’
LINES TERMINATED BY ‘\n’;
Hive字段的默认值

ive 查看表描述

DESCRIBE [EXTENDED|FORMATTED] table_name

Hive 建表

Create Table Like:
CREATE TABLE empty_key_value_store LIKE key_value_store;
Create Table As Select (CTAS)
CREATE TABLE new_key_value_store
AS
SELECT columA, columB FROM key_value_store;

hive分区

Hive 分区partition
必须在表定义时指定对应的partition字段
a、单分区建表语句：
create table day_table (id int, content string) partitioned by (dt string);
单分区表，按天分区，在表结构中存在id，content，dt三列。
以dt为文件夹区分
b、双分区建表语句：
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表，按天和小时分区，在表结构中新增加了dt和hour两列。
先以dt为文件夹，再以hour子文件夹区分

Hive添加分区表语法

（表已创建，在此基础上添加分区）：
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location1’] partition_spec [LOCATION ‘location2’] …;
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, …)
例：
ALTER TABLE day_table ADD PARTITION (dt=‘2008-08-08’, hour=‘08’)

Hive删除分区语法：

ALTER TABLE table_name DROP partition_spec, partition_spec,…
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, …)
用户可以用 ALTER TABLE DROP PARTITION 来删除分区。
内部表中、对应分区的元数据和数据将被一并删除。
例：
ALTER TABLE day_hour_table DROP PARTITION (dt=‘2008-08-08’, hour=‘09’);

Hive向指定分区添加数据语法：

LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
例：
LOAD DATA INPATH ‘/user/pv.txt’ INTO TABLE day_hour_table PARTITION(dt=‘2008-08- 08’, hour=‘08’);
LOAD DATA local INPATH ‘/user/hua/*’ INTO TABLE day_hour partition(dt=‘2010-07- 07’);
当数据被加载至表中时，不会对数据进行任何转换。Load操作只是将数据复制至Hive表对应的位置。数据加载时在表下自动创建一个目录

Hive查询执行分区语法

SELECT day_table.* FROM day_table WHERE day_table.dt>= ‘2008-08-08’;
分区表的意义在于优化查询。查询时尽量利用分区字段。如果不使用分区字段，就会全部扫描。

Hive查询表的分区信息语法：
SHOW PARTITIONS day_hour_table;

预先导入分区数据，但是无法识别怎么办
Msck repair table tablename
直接添加分区

Hive DML

数据上传

导入数据
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
添加数据
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION … [IF NOT EXISTS]] elect_statement2]
[INSERT INTO TABLE tablename2 [PARTITION …] select_statement2] …;

Hive SerDe

Hive SerDe - Serializer and Deserializer
SerDe 用于做序列化和反序列化。
构建在数据存储和执行引擎之间，对两者实现解耦。
Hive通过ROW FORMAT DELIMITED以及SERDE进行内容的读写。
row_format
: DELIMITED
[FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
: SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, …)]

Hive正则匹配
CREATE TABLE logtbl (
host STRING,
identity STRING,
t_user STRING,
time STRING,
request STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = "([^ ]) ([^ ]) ([^ ]) \[(.)\] “(.)" (-|[0-9]) (-|[0-9]*)”
)
STORED AS TEXTFILE;

hive函数

在这里插入图片描述

hive创建实例

create table psn
(
id int,
name string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’;
load data local inpath ‘/root/data/data’ into table psn;
create table psn2
(
id int,
name string,
likes array,
address map<string,string>
)
create table psn3
(
id int,
name string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ‘\001’
collection items terminated by ‘\002’
map keys terminated by ‘\003’;
外部表
create external table psn4
(
id int,
name string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’
location ‘/usr/’;

内部表和外部表的区别：

1、创建表的时候，内部表直接存储再默认的hdfs路径，外部表需要自己指定路径
2、删除表的时候，内部表会将数据和元数据全部删除，外部表只删除元数据，数据不删除
注意：hive：读时检查（实现解耦，提高数据记载的效率）
关系型数据库：写时检查
分区：
单分区
create table psn5
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’;
双分区：
create table psn6
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int,sex string)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’;
create external table psn7
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
map keys terminated by ‘:’
location ‘/usr/’;
DML:
create table psn10
(
id int,
name string
)
row format delimited
fields terminated by ‘,’
create table psn11
(
id int,
likes array
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘-’
FROM psn
INSERT OVERWRITE TABLE psn10
SELECT id,name
insert into psn11
select id,likes
insert overwrite local directory ‘/root/result’
select * from psn;

Hive案例

需求：统计出掉线率最高的前10基站
数据：
record_time：通话时间
imei：基站编号
cell：手机编号
drop_num：掉话的秒数
duration：通话持续总秒数
在这里插入图片描述

1.建表

create table cell_monitor(
record_time string,
imei string,
cell string,
ph_num int,
call_num int,
drop_num int,
duration int,
drop_rate DOUBLE,
net_type string,
erl string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

结果表

create table cell_drop_monitor(
imei string,
total_call_num int,
total_drop_num int,
d_rate DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE;

2、load数据

LOAD DATA LOCAL INPATH ‘/opt/data/cdr_summ_imei_cell_info.csv’ OVERWRITE INTO TABLE cell_monitor;

3、找出掉线率最高的基站

from cell_monitor cm
insert overwrite table cell_drop_monitor
select cm.imei ,sum(cm.drop_num),sum(cm.duration),sum(cm.drop_num)/sum(cm.duration) d_rate
group by cm.imei
sort by d_rate desc;

Hive 参数

	hive 参数、变量
hive当中的参数、变量，都是以命名空间开头

在这里插入图片描述
通过${}方式进行引用，其中system、env下的变量必须以前缀开头

hive 参数设置方式

1、修改配置文件 ${HIVE_HOME}/conf/hive-site.xml
2、启动hive cli时，通过–hiveconf key=value的方式进行设置
例：hive --hiveconf hive.cli.print.header=true
3、进入cli之后，通过使用set命令设置
hive set命令
在hive CLI控制台可以通过set对hive中的参数进行查询、设置
set设置：
set hive.cli.print.header=true;
set查看
set hive.cli.print.header
hive参数初始化配置
当前用户家目录下的.hiverc文件
如: ~/.hiverc
如果没有，可直接创建该文件，将需要设置的参数写到该文件中，hive启动运行时，会加载改文件中的配置。
hive历史操作命令集
~/.hivehistory

Hive 动态分区

开启支持动态分区
set hive.exec.dynamic.partition=true;
默认：true
set hive.exec.dynamic.partition.mode=nostrict;
默认：strict（至少有一个分区列是静态分区）
相关参数
set hive.exec.max.dynamic.partitions.pernode;
每一个执行mr节点上，允许创建的动态分区的最大数量(100)
set hive.exec.max.dynamic.partitions;
所有执行mr节点上，允许创建的所有动态分区的最大数量(1000)
set hive.exec.max.created.files;
所有的mr job允许创建的文件的最大数
加载数据
from psn21
insert overwrite table psn22 partition(age, sex)
select id, name, age, sex, likes, address distribute by age, sex;

hive Lateral View

Lateral View用于和UDTF函数（explode、split）结合来使用。
首先通过UDTF函数拆分成多行，再将多行结果组合成一个支持别名的虚拟表。
主要解决在select使用UDTF做查询过程中，查询只能包含单个UDTF，不能包含其他字段、以及多个UDTF的问题

语法：
LATERAL VIEW udtf(expression) tableAlias AS columnAlias (’,’ columnAlias)
统计人员表中共有多少种爱好、多少个城市? 在这里插入图片描述

select count(distinct(myCol1)), count(distinct(myCol2)) from psn2
LATERAL VIEW explode(likes) myTable1 AS myCol1
LATERAL VIEW explode(address) myTable2 AS myCol2, myCol3;

Hive 视图

和关系型数据库中的普通视图一样，hive也支持视图
特点：
不支持物化视图
只能查询，不能做加载数据操作
视图的创建，只是保存一份元数据，查询视图时才执行对应的子查询
view定义中若包含了ORDER BY/LIMIT语句，当查询视图时也进行ORDER BY/LIMIT语句操作，view当中定义的优先级更高
view支持迭代视图

View语法

创建视图：
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name
[(column_name [COMMENT column_comment], …) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, …)]
AS SELECT … ;
查询视图：
select colums from view;
删除视图：
DROP VIEW [IF EXISTS] [db_name.]view_name;

Hive 索引

目的：优化查询以及检索性能
创建索引：
create index t1_index on table psn2(name)
as ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ with deferred rebuild
in table t1_index_table;
as：指定索引器；
in table：指定索引表，若不指定默认生成在default__psn2_t1_index__表中

create index t1_index on table psn2(name)
as ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ with deferred rebuild;

查询索引
show index on psn2;

重建索引（建立索引之后必须重建索引才能生效）
ALTER INDEX t1_index ON psn REBUILD;

删除索引
DROP INDEX IF EXISTS t1_index ON psn2;

Hive 运行方式

命令行方式cli：控制台模式
脚本运行方式（实际生产环境中用最多）
JDBC方式：hiveserver2
web GUI接口（hwi、hue等）

Hive在CLI模式中
与hdfs交互
执行执行dfs命令
例：dfs –ls /
与Linux交互
！开头
例： !pwd
Hive脚本运行方式：
hive -e “”
hive -e “”>aaa
hive -S -e “”>aaa
hive -f file
hive -i /home/my/hive-init.sql
hive> source file (在hive cli中运行)
Hive GUI接口
web界面安装：
下载源码包apache-hive-*-src.tar.gz
将hwi war包放在 $HIVE_HOME/lib/ 制作方法：将hwi/web/*里面所有的文件打成war包 cd apache-hive-1.2.1-src/hwi/web jar -cvf hive-hwi.war * 复制tools.jar(在jdk的lib目录下)到$ HIVE_HOME/lib下
修改hive-site.xml
启动hwi服务(端口号9999)
hive --service hwi
浏览器通过以下链接来访问
http://node3:9999/hwi/

修改hive配置文件hive-site.xml添加以下配置内容：

hive.hwi.listen.host
0.0.0.0

hive.hwi.listen.port
9999

hive.hwi.war.file
lib/hive-hwi.war

在这里插入图片描述

Hive 权限管理

三种授权模型：
1、Storage Based Authorization in the Metastore Server
基于存储的授权 - 可以对Metastore中的元数据进行保护，但是没有提供更加细粒度的访问控制（例如：列级别、行级别）。
2、SQL Standards Based Authorization in HiveServer2
基于SQL标准的Hive授权 - 完全兼容SQL的授权模型，推荐使用该模式。
3、Default Hive Authorization (Legacy Mode)
hive默认授权 - 设计目的仅仅只是为了防止用户产生误操作，而不是防止恶意用户访问未经授权的数据。
Hive - SQL Standards Based Authorization in HiveServer2
完全兼容SQL的授权模型
除支持对于用户的授权认证，还支持角色role的授权认证
role可理解为是一组权限的集合，通过role为用户授权
一个用户可以具有一个或多个角色
默认包含另种角色：public、admin

Hive - SQL Standards Based Authorization in HiveServer2
限制：
1、启用当前认证方式之后，dfs, add, delete, compile, and reset等命令被禁用。
2、通过set命令设置hive configuration的方式被限制某些用户使用。
（可通过修改配置文件hive-site.xml中hive.security.authorization.sqlstd.confwhitelist进行配置）
3、添加、删除函数以及宏的操作，仅为具有admin的用户开放。
4、用户自定义函数（开放支持永久的自定义函数），可通过具有admin角色的用户创建，其他用户都可以使用。
被禁用。

Hive - SQL Standards Based Authorization in HiveServer2
在hive服务端修改配置文件hive-site.xml添加以下配置内容：

hive.security.authorization.enabled
true

hive.server2.enable.doAs
false

hive.users.in.admin.role
root

hive.security.authorization.manager
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

hive.security.authenticator.manager
org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator

服务端启动hiveserver2；客户端通过beeline进行连接,

角色管理

角色的添加、删除、查看、设置：
CREATE ROLE role_name; – 创建角色
DROP ROLE role_name; – 删除角色
SET ROLE (role_name|ALL|NONE); – 设置角色
SHOW CURRENT ROLES; – 查看当前具有的角色
SHOW ROLES; – 查看所有存在的角色
在这里插入图片描述
权限：
SELECT privilege – gives read access to an object.
INSERT privilege – gives ability to add data to an object (table).
UPDATE privilege – gives ability to run update queries on an object (table).
DELETE privilege – gives ability to delete data in an object (table).
ALL PRIVILEGES – gives all privileges (gets translated into all the above privileges).

Hive 优化

核心思想：把Hive SQL 当做Mapreduce程序去优化
以下SQL不会转为Mapreduce来执行
select仅查询本表字段
where仅对本表字段做条件过滤

Explain 显示执行计划

EXPLAIN [EXTENDED] query

Hive抓取策略：
Hive中对某些情况的查询不需要使用MapReduce计算

抓取策略
Set hive.fetch.task.conversion=none/more;

Hive运行方式：
本地模式
集群模式

本地模式
开启本地模式：
set hive.exec.mode.local.auto=true;
注意：
hive.exec.mode.local.auto.inputbytes.max默认值为128M
表示加载文件的最大值，若大于该配置仍会以集群方式来运行！

并行计算
通过设置以下参数开启并行模式：
set hive.exec.parallel=true;

注意：hive.exec.parallel.thread.number
（一次SQL计算中允许并行执行的job个数的最大值）

严格模式:
通过设置以下参数开启严格模式：
set hive.mapred.mode=strict;
（默认为：nonstrict非严格模式）

查询限制：
1、对于分区表，必须添加where对于分区字段的条件过滤；
2、order by语句必须包含limit输出限制；
3、限制执行笛卡尔积的查询。

Hive排序
Order By - 对于查询结果做全排序，只允许有一个reduce处理
（当数据量较大时，应慎用。严格模式下，必须结合limit来使用）
Sort By - 对于单个reduce的数据进行排序
Distribute By - 分区排序，经常和Sort By结合使用
Cluster By - 相当于 Sort By + Distribute By
（Cluster By不能通过asc、desc的方式指定排序规则；
可通过 distribute by column sort by column asc|desc 的方式）

Hive Join
Join计算时，将小表（驱动表）放在join的左边
Map Join：在Map端完成Join
两种实现方式：
1、SQL方式，在SQL语句中添加MapJoin标记（mapjoin hint）
语法：
SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value
FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;
2、开启自动的MapJoin

Hive Join

尽可能使用相同的连接键（会转化为一个MapReduce作业）

大表join大表

空key过滤：有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。
空key转换：有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上

Map-Side聚合
通过设置以下参数开启在Map端的聚合：
set hive.map.aggr=true;

相关配置参数：
hive.groupby.mapaggr.checkinterval：
map端group by执行聚合时处理的多少行数据（默认：100000）
hive.map.aggr.hash.min.reduction：
进行聚合的最小比例（预先对100000条数据做聚合，若聚合之后的数据量/100000的值大于该配置0.5，则不会聚合）
hive.map.aggr.hash.percentmemory：
map端聚合使用的内存的最大值
hive.map.aggr.hash.force.flush.memory.threshold：
map端做聚合操作是hash表的最大可用内容，大于该值则会触发flush
hive.groupby.skewindata
是否对GroupBy产生的数据倾斜做优化，默认为false

控制Hive中Map以及Reduce的数量

Map数量相关的参数
mapred.max.split.size
一个split的最大值，即每个map处理文件的最大值
mapred.min.split.size.per.node
一个节点上split的最小值
mapred.min.split.size.per.rack
一个机架上split的最小值

Reduce数量相关的参数
mapred.reduce.tasks
强制指定reduce任务的数量
hive.exec.reducers.bytes.per.reducer
每个reduce任务处理的数据量
hive.exec.reducers.max
每个任务最大的reduce数