Hive系列(四)hive高级篇

程序员劝退师丶

已于 2022-06-07 17:07:05 修改

阅读量2k

点赞数

于 2022-01-09 22:11:10 首次发布

本文链接：https://blog.csdn.net/qq_38130094/article/details/122394314

版权

大数据专栏收录该内容

22 篇文章 2 订阅

订阅专栏

1.2 在启动hive cli时，通过--hiveconf key=value的方式进行设置

1.3 在进入到cli之后，通过set命令设置

1.hive参数配置相关操作

hive当中的参数、变量都是以命名空间开头的，详情如下表所示：

命名空间	读写权	含义
hiveconf	可读写	hive-site.xml当中的各配置变量例：hive --hiveconf hive.cli.print.header=true
system	可读写	系统变量，包含JVM运行参数等例：system:user.name=root
env	只读	环境变量例：env：JAVA_HOME
hivevar	可读写	例：hive -d val=key

1.1在${HIVE_HOME}/conf/hive-site.xml文件中添加参数设置

注意：永久生效，所有的hive会话都会加载对应的配置

1.2 在启动hive cli时，通过--hiveconf key=value的方式进行设置

例如：hive --hiveconf hive.cli.print.header=true

1.3 在进入到cli之后，通过set命令设置

--在hive cli控制台可以通过set对hive中的参数进行查询设置
--set设置
	set hive.cli.print.header=true;
--set查看
	set hive.cli.print.header
--set查看全部属性
	set

2. Hive连接方式

2.1 Hive 运行方式

1.命令行方式cli：控制台模式

在hive命令行方式也是可以使用hdfs命令的，而且是比直接使用hdfs要快的，因为不需要在建立连接了，

#使用感叹号代表实在linux命令行使用的命令
! ls /

2.脚本运行方式（实际生产环境中用最多）

使用hive --service cli --help 查看帮助

# 自定义变量并赋值
hive -d
# 执行完命令后退出命令行
hive -e
# 静默模式(比如会把执行sql后显示“Ok”和time token省略)
hive -S
#从指定文件读取sql来执行
hive -f

2.2 Hive远程运行方式

1.JDBC方式：hiveserver2

2.web GUI接口（hwi、hue等）

3. Hive 动态分区和分桶

动态分区：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DynamicPartitionsd

对照示例：

# 新增示例数据（数据需要在hive启动的服务上）
1,小明1,12,man,lol-movie-book,beijin:haidian-shanghai:yangtou
2,小明2,13,woman,lol-movie-book,beijin:haidian-shanghai:yangtou
3,小明3,12,woman,lol-movie-book,beijin:haidian-shanghai:yangtou
4,小明4,13,man,lol-movie-book,beijin:haidian-shanghai:yangtou
5,小明5,12,woman,lol-movie-book,beijin:haidian-shanghai:yangtou
6,小明7,13,man,lol-movie,beijin:haidian-shanghai:yangtou

create table psn21(
id int,
name string,
age int,
gender string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

load local inpath '/data/data2' into table psn21;

采用insert的方式插入到分区示例：

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

create table psn22(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (age int,gender string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

insert into table psn22 partition(age,gender) select id,name,likes,address,age,gender from psn21;

– 默认：strict（至少有一个分区列是静态分区）

set hive.exec.dynamic.partition.mode=nostrict;

hive 动态分区 
    – 开启支持动态分区 
        ▪ set hive.exec.dynamic.partition=true; 
            – 默认：true 
        ▪ set hive.exec.dynamic.partition.mode=nostrict; 
            – 默认：strict（至少有一个分区列是静态分区）
相关参数 :
    – set hive.exec.max.dynamic.partitions.pernode; 
        ▪ 每一个执行mr节点上，允许创建的动态分区的最大数量(100) 
    – set hive.exec.max.dynamic.partitions; 
        ▪ 所有执行mr节点上，允许创建的所有动态分区的最大数量(1000) 
    – set hive.exec.max.created.files; 
        ▪ 所有的mr job允许创建的文件的最大数量(100000)

3. 分桶

二、分桶的作用：

1、进行抽样：在处理大规模数据集时，在开发和修改查询的阶段，可以使用整个数据集的一部分进行抽样测试查询、修改。可以使得开发更高效。

2、 map-side join：获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个

# 开启支持分桶 
set hive.enforce.bucketing=true; 
# 默认：false；设置为true之后，mr运行时会根据bucket的个数自动分配reduce task个数。（用 户也可以通过mapred.reduce.tasks自己设置reduce任务个数，但分桶时不推荐使用）
#注意：一次作业产生的桶（文件数量）和reduce task个数一致
#往分桶表中加载数据 
insert into table bucket_table select columns from tbl;
insert overwrite table bucket_table select columns from tbl;

示例demo：

CREATE TABLE demo31( 
id INT,
name STRING,
age INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

#测试数据： 
1,tom,11
2,cat,22
3,dog,33
4,hive,44
5,hbase,55
6,mr,66
7,alice,77
8,scala,88

load data local inpath '/data/bucketdata' into demo31;
#创建分桶表
CREATE TABLE bucketpsn31( 
id INT,
name STRING,
age INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

insert into bucketpsn31 select * from demo31;

#验证分区桶数据
dfs -cat /user/hive_remote/warhouse/

4. Hive 视图

在关系型数据库常用的方式，

如果遇到多表复杂的查询视图可以简化查询

多表join关联查询后的查询结果可以作为视图

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
  [COMMENT view_comment]
  [TBLPROPERTIES (property_name = property_value, ...)]
  AS SELECT ...;

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView

#查看hive表
show tables;
#或者查看元数据mysql.hive_remote;字段TBLS_TYPE可以查看hive表的类型(内部表/外部表/视图)
select * from TBLS;
#创建视图
create view view_psn as select * from psn where id<5;

4.2 Hive 索引

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterIndex

目的：优化查询检索性能，

Hive索引保存的是文件所在的目录和偏移量；

create index

CREATE INDEX index_name
  ON TABLE base_table_name (col_name, ...)
  AS index_type
  [WITH DEFERRED REBUILD]
  [IDXPROPERTIES (property_name=property_value, ...)]
  [IN TABLE index_table_name]
  [
     [ ROW FORMAT ...] STORED AS ...
     | STORED BY ...
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (...)]
  [COMMENT "index comment"];

#给表psn2的name字段创建索引；索引数据存储在t1_index_table
create index t1_index on table psn2(name)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild
in table t1_index_table;

#重建索引（建立索引之后必须重建索引才能生效）
ALTER INDEX t1_index ON psn2 REBUILD;
#当hive表的psn2数据新插入后；索引表也是需要再次执行更新（增量添加索引数据），索引数据才能插入

#删除索引
delete index if exists t1_index on psn2;

创建完索引表后，表是空的没有数据需要手动来处理

4.3 Hive join操作

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins和msyql很类似

Join Syntax

join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
 
table_reference:
    table_factor
  | join_table
 
table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )
 
join_condition:
    ON expression

4.4 Hive权限管理

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization#LanguageManualAuthorization-1StorageBasedAuthorizationintheMetastoreServer

根据官方文档一共有四种权限方式：

Storage Based Authorization in the Metastore Server
SQL Standards Based Authorization in HiveServer2(推荐使用的方式)
Authorization using Apache Ranger & Sentry
Old default Hive Authorization (Legacy Mode)

第一种基于存储的授权 - 可以对Metastore中的元数据进行保护，但是没有提供更加细粒度的访问控制（例如：列级别、行级别）；第三种使用Ranger & Sentry第三方管理组件，第四种hive默认老的授权模型 - 设计目的仅仅只是为了防止用户产生误操作，而不是防止恶意用户访问未经授权的数据。

推荐使用第二种：基于SQL标准的Hive授权 - 完全兼容SQL的授权模型。

Hive权限中只有角色没有用户概念；默认的角色 admin 和 public

如果开启权限管理后的限制：

1、启用当前认证方式之后，dfs, add, delete, compile, and reset等命令被禁用。
2、通过set命令设置hive configuration的方式被限制某些用户使用。 – （可通过修改配置文件hive-site.xml中hive.security.authorization.sqlstd.confwhitelist进行配置）
3、添加、删除函数以及宏的操作，仅为具有admin的用户开放。
4、用户自定义函数（开放支持永久的自定义函数），可通过具有admin角色的用户创建，其他用户都可以使用。
5、Transform功能被禁用。

修改node4上面hive服务的配置文件hive-site.xml

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123</value>
</property>
<property>
    <name>hive.security.authorization.enabled</name>
    <value>true</value>
</property>
<property>
    <name>hive.server2.enable.doAs</name>
    <value>false</value>
</property>
<!-- 重要当前root用户为admin和hadoop账号 -->
<property>
    <name>hive.users.in.admin.role</name>
    <value>root,hadoop</value>
</property>
<property>
    <name>hive.security.authorization.manager</name>                              
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
<property>
    <name>hive.security.authenticator.manager</name>
    <value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator</value>
</property>

修改完成后启动hive服务

hiveserver2

在node5通过beeline方式连接

beeline -u jdbc:hive2://node4:10000/default

#此时有问题的,可以查询表名，但是不能查询表数据；默认的用户不具备查询表的权限
show tables;
select * from psn;

show current roles;

#切换到root权限的用户来创建角色
beeline 
! connect jdbc:hive2://node5:10000/default hadoop 123

#用户默认是有多个角色的，需要 先切换到管理员角色
set role admin;

#创建角色
create role teset;

#把当前test的role赋予admin权限（对角色赋权）
grant admin to role test with admin option;

#删除角色权限
revoke admin from role test;

#查看角色权限
show role grant role [roleName];

程序员劝退师丶

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive系列(四)hive高级篇

1.hive参数操作1.在${HIVE_HOME}/conf/hive-site.xml文件中添加参数设置注意：永久生效，所有的hive会话都会加载对应的配置2.在启动hive cli时，通过--hiveconf key=value的方式进行设置例如：hive --hiveconf hive.cli.print.header=true3、在进入到cli之后，通过set命令设置--在hive cli控制台可以通过set对hive中的参数进行查询设置--set设置 set hi.
复制链接

扫一扫

专栏目录