hive基础重点总结

最新推荐文章于 2021-11-09 15:28:33 发布

仙女的崽儿

最新推荐文章于 2021-11-09 15:28:33 发布

阅读量516

点赞数 2

分类专栏：大数据的学习

本文链接：https://blog.csdn.net/xhzxhz12/article/details/116209736

版权

大数据的学习专栏收录该内容

19 篇文章 0 订阅

订阅专栏

一、表分类

内部表：也叫管理表，表目录会创建在hdfs得/usr/hive/warehouse/下的相应的库对应的目录中。
外部表：外部表会根据创建表时LOCATION指定的路径来创建目录，如果没有指定LOCATION，则位置跟内部表相同，一般使用的时第三方提供的或者公用的数据。

内部表与外部表之间的区别

1.内部表与外部表在创建时的差别：就差两个关键字，EXTERNAL LOCATION举例

内部表

create table t_inner(id int);

外部表

create exernal table t_outer(id int) location 'hdfs:///AA/BB/XX';

2.Hive表创建时要做的两件事：

在HDFS下创建表目录
在元数据库Mysql创建相应表的描述数据(元数据)

3.drop时有不同的特性

drop时，元数据都会被清楚
drop时，内部表的表目录会被删除，但是外部表的表目录不会被删除

4.使用场景

内部表：平时用来测试或者少量数据，并且自己可以随时修改删除数据
外部表：使用后数据不想被删除的情况使用外部表(推荐使用)，所以，整个数据仓库的最底层的表使用外部表

5.内部表和外部表转换

alter table a1 set tblproperties('EXTERNAL'='TRUE'); 
#内部表转外部表，true 一定要大写！！！！！！！！; 

alter table a1 set tblproperties('EXTERNAL'='false');
#false大小写都没有关系

二、Hive中数据的获取方式
2.1数据的加载(两种)(需要在hive下输入)：

一种是从本地linux上加载到hive中

load data local inpath '文件路径‘ into table 表明;

另一种时从HDFS加载到hive中

load data inpath '分布式系统中的文件路径’ into table 表名;

加载数据的本质：

如果数据在本地，加载数据的本质就是将数据copy到hdfs上的表目录上
如果数据在hdfs上，加载数据的本质是将数据移动到hdfs的表目录下
如果重复加载同一份数据，不会覆盖

2.2使用insert into 方式灌入数据

先通过创建一个和旧表结构一样的表，然后通过查询条件灌入到新表中

insert into 旧表
select * from 新表；

2.3克隆表(三种)

2.3.1不带数据，只克隆表的机构

create table if not exists 新表 like 旧表;

2.3.2克隆表并带数据(使用like)

create table if not exists 新表 like 旧表 location '旧表在分布式系统中的数据路径'

2.3.3 克隆表并带数据(使用as)

create table 新表
as 
select * from 旧表;

三、HIve Shell技巧(*)
1 、查看所有hive参数

# 在hive命令⾏直接输⼊set 即可
hive> set

2、只执行⼀次Hive命令
通过shell的参数 -e 可以执行一次就运行完的命令

[root@master hive]# hive -e "select * from cat"

小技巧2:可以通过外部命令快速查询某个变量值:hive -S -e “set” |grep cli.print
-S 是静默模式,会省略到多余的输出

3、单独执行⼀个sql文件
通过参数-f +file文件名就可以,经常用在以后的sql文件单独执行,导入数据场景中

[root@master hive]# hive -f /path/cat.sql

4、执行Linux命令
在Hive的shell中加上前缀! 最后以分号;结尾,可以执⾏linux的命令

hive> ! pwd ;

5、执行HDFS命令
用户可以在Hive的shell中执行HDFS的DFS命令,不用敲入前缀hdfs或者hadoop

hive> dfs -ls /tmp

6、使用历史命令和自动补全
在Hive的Shell操作中可以使用上下箭头查看历史记录
如果忘记了命令可以用tab键进行补全

7、显示当前库：
下面是通过配置文件 hive-site.xml 显示

<property>
	 <name>hive.cli.print.current.db</name>
	 <value>false</value>
	 <description>Whether to include the current database in the Hive prompt.</description>
</property>

8、当前session⾥设置该参数：

hive> set hive.cli.print.current.db=true;

9、查看当前参数设置的值：

小技巧1:可以在shell中输⼊set命令,可以看到hive已经设定好的参数

hive> set hive.cli.print.current.db;

四、hive中的分区

1.分区的理解（为什么要进行分区）

随着系统运行的时间越来越长，表的数据会越来越大，而hive查询通常是使用全表扫描，这样会导致大量不必要的数据扫描，从而大大降低了查询的效率。

2.分区的语法(partitioned by)

-- 在创建Hive表时加上下面分区语法 
[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...)] 12

分区的注意事项:

hive的分区名不区分大小写
hive的分区字段是一个伪字段，但是可以用来进行操作
一张表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区。

3.分区本质

在表的目录或者是分区的目录下在创建目录，分区的目录名为指定字段=值

4.创建分区
4.1一级分区：当前表中只有一个分区字段，插入数据时，指定这一个分区即可。

create table if not exists part1(id int,name string)
partitioned by (dt string)
row format delimited fields terminated by ',';

4.2二级分区和三级分区的创建方式与一级分区的创建方式一样，只不过在partitioned by中的参数不一样，二级为两个参数，三级为三个参数

partitioned by (year string,month string)
partitioned by (year string,month string,day string)

5.加载数据

一级加载数据：
load data local inpath "/opt/data/user.txt" into table part1 partition(dt="2019-08-08");
二级加载数据：
load data local inpath '/opt/soft/data/user.txt' into table part2 partition(year='2018',month='03');
三级加载数据：
load data local inpath '/opt/soft/data/user.txt' into table part3 partition(year='2018',month='03',day='21');

五、分区类型

静态分区：加载数据到指定分区的值，新增分区或者加载分区时指定分区名。
动态分区：数据未知，根据分区的值来确定需要创建的分区。
混合分区：静态和动态都有。

六、动态分区

动态分区的属性配置
是否能动态分区
hive.exec.dynamic.partition=true 
设置为非严格模式
hive.exec.dynamic.partition.mode=nonstrict 
最大分区数
hive.exec.max.dynamic.partitions=1000 
最大分区节点数
hive.exec.max.dynamic.partitions.pernode=100

示例：

create table dy_part1( id int, name string )
partitioned by (dt string) 
row format delimited fields terminated by ',';

向动态分区中使用insert into的方式加载数据(注意：不可以用load方式)
先创建临时表：
 create table temp_part( 
 id int, 
 name string, 
 dt string )
 row format delimited fields terminated by ',';

导入数据：
load data local inpath '/hivedata/student.txt' into table temp_part;

# 如果是严格模式,则不能导入,需要执行下面代码 
set hive.exec.dynamic.partition.mode=nonstrict

insert into dy_part1 partition(dt) 
select id,name,dt from temp_part;

七、混合分区示例

创建一个混合分区表
create table dy_part2( 
id int, 
name string 
)
partitioned by (year string,month string,day string) 
row format delimited fields terminated by ','; 

创建临时表并加载数据
-- 创建分区表 
create table temp_part2( 
id int, 
name string,
year string, 
month string, 
day string 
)
row format delimited fields terminated by ',';

#加载数据 数据参考(data/student.txt文件) 
load data local inpath '/opt/data/student.txt' into table temp_part2;

导入数据到分区表
注意：这里不能使用select * 来进行查询，必须按照字段顺序进行查询
错误用法：
 insert into dy_part2 partition (year='2018',month,day) select * from temp_part2; 
 正确用法：
insert into dy_part2 partition (year='2018',month,day) 
select id,name,month,day from temp_part2;

八、分区表注意事项

hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
分区字段不建议使用中文
一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致 namenode 和 resourcemanager 的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
分区属性的修改都可以使用修改元数据和hdfs数据内容。

九、分桶

1.分桶的概念(为什么要分桶)

当单个的分区或者表的数据量过大，分区不能更细粒度的划分数据，就需要使用分桶技术将数据划分成更细的粒度。

2.分桶的实现(clustered by、sorted by)

[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...) 
[SORTED BY (COLUMNNAME [ASC|DESC])...] INTO NUM_BUCKETS BUCKETS]

3.关键字及其原理：bucket

分桶的原理:跟MR中的HashPartitioner原理一样:都是key的hash值取模reduce的数量

MR中：按照key的hash值除以reduceTask取余
Hive中：按照分桶字段的hash值取模除以分桶的个数

4.分桶的意义

为了保存分桶查询的分桶结构（数据按照分桶字段进行保存hash散列）
分桶表的数据进行抽样和JOIN时可以提高查询效率,一般是用来抽样查询

5.分桶的实现示例：

1.创建分桶表
create table if not exists buc13( 
id int, 
name string, 
age int 
)
clustered by (id) into 4 buckets 
row format delimited fields terminated by ',';

2.创建临时表
create table if not exists temp_buc1( 
id int, 
name string, 
age int 
)
row format delimited fields terminated by ',';

3.加载数据到临时表
load data local inpath '/hivedata/buc1.txt' into table temp_buc1;

4.使用分桶查询将数据导入到分桶表
insert overwrite table buc13 
select id,name,age from temp_buc1 
cluster by (id); 

5.设置强制分桶的属性
<!-- 如果要分桶,就要打开分桶的强制模式 --> 
set hive.enforce.bucketing=false/true //设置成true

6.如果设置了reduces的个数和总桶数不一样，请手动设置
set mapreduce.job.reduces=-1 #-1表示可以根据实际需要来定制reduces的数量 1
这里设置成4
set mapreduce.job.reduces=4

7.创建指定排序字段的分桶表
create table if not exists buc8( 
id int, 
name string,
 age int 
 )
 clustered by (id) 
 sorted by (id) into 4 buckets
 row format delimited fields terminated by ',';

8.导入数据
insert overwrite table buc8 
select id,name,age from temp_buc1 
cluster by (id);

十、分桶表的查询案例

默认有4桶
查询第一桶 select * from buc3 tablesample(bucket 1 out of 4 on sno); 
查询第一桶和第三桶 select * from buc3 tablesample(bucket 1 out of 2 on sno); 
查询第一桶的前半部分 select * from buc3 tablesample(bucket 1 out of 8 on sno);

tablesample(bucket x out of y on id)

语法： 
tablesample(bucket x out of y on sno) 
注意：tablesample一定是紧跟在表名之后 
x:代表从第几桶开始查询 
y:查询的总桶数，y可以是总桶数的倍数或者因子，x不能大于y

x表示从哪个bucket开始抽取。例如，table总bucket数为32，
tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，
分别为第3个bucket和第 （3+16=）19个bucket的数据。

当y<桶数 我们称y为因子 实例 (bucket 1 out of 2 on sno) 取的是第一桶和 第三桶
当y>桶数 我们称y为倍数 实例 (bucket 1 out of 8 on sno) m=桶数%y m是 余数
解释:当x=1,桶数=4,y=8, 只要sno对y取余=0我们就认为这条数据在第1桶

十一、分桶查询

注意y可以不是总桶数的倍数,但是他会重新分桶,重新查询.

查询sno为奇数的数据
select * from buc3 tablesample(bucket 2 out of 2 on sno); 
查询sno为偶数且age大于30的人 
select * from buc3 tablesample(bucket 1 out of 2 on sno) where age>30;

注意:这里会报错,talesample一定是紧跟在表名之后 
select * from buc3 where age>30 tablesample(bucket 1 out of 2 on sno);

注意:由于有编码问题:当我们写中文时要注意.编码不对查不出结果

其他查询知识: 
select * from buc3 limit 3; 查出三行 

select * from buc3 tablesample(3 rows); 查出三行 

select * from buc3 tablesample(13 percent); 
查出13%的内容,如果百分比不够现实 一行,至少会显示一行,如果百分比为0,显示第一桶 

select * from buc3 tablesample(68B);
k KB M G T P 查出68B包含的数据,如果是 0B,默认显示第一桶 要求随机抽取3行数据： 

select * from t_stu order by rand() limit 3; 随机显示3条数据

十二、对分桶表的总结

定义
clustered by (id) —指定分桶的字段
sorted by (id asc|desc) —指定数据的排序规则，表示咱们预期的数据是以这种规则进行的排序
导入数据
cluster by (id) —指定getPartition以哪个字段来进行hash，并且排序字段也是指定的字段，排序是以asc排列
distribute by (id) ---- 指定getPartition以哪个字段来进行hash
sort by (name asc | desc) —指定排序字段
导数据时：
insert overwrite table buc3
select id,name,age from temp_buc1
distribute by (id) sort by (id asc);
和下面的语句效果一样
insert overwrite table buc4
select id,name,age from temp_buc1
cluster by (id) ;

注意事项

分区使用的是表外字段，分桶使用的是表内字段。
分桶更加用于细粒度的管理数据，更多的是使用来做抽样、join。

仙女的崽儿

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive基础重点总结

一、表分类内部表：也叫管理表，表目录会创建在hdfs得/usr/hive/warehouse/下的相应的库对应的目录中。外部表：外部表会根据创建表时LOCATION指定的路径来创建目录，如果没有指定LOCATION，则位置跟内部表相同，一般使用的时第三方提供的或者公用的数据。内部表与外部表之间的区别1.内部表与外部表在创建时的差别：就差两个关键字，EXTERNAL LOCATION举例内部表create table t_inner(id int);外部表create exern
复制链接

扫一扫