Hive的管理表、外部表、分区表、分桶

最新推荐文章于 2023-08-24 20:08:04 发布

Geek白先生

最新推荐文章于 2023-08-24 20:08:04 发布

阅读量340

点赞数

分类专栏： Hive 文章标签：分区表分桶管理表外部表

本文链接：https://blog.csdn.net/weixin_43699817/article/details/100065482

版权

Hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

管理表

管理表又称托管表、内部表，默认创建的表都是管理表

案例：
hive> create table if not exists stocks (exchange1 string,symbol string,ymd string) row format delimited fields terminated by ‘,’ ;
hive>load data inpath ‘/user/hadoop/stocks/stocks.csv’ into table stocks;【加载数据】
hive>drop table stocks;

特点：
1）在同一个文件系统(HDFS)中，加载数据的实质是将数据移动至Hive的数据仓库下
2）管理表在做删除操作时，会将数据一并删除，原有数据不存在

外部表

案例：
hive> create external table if not exists stocks (exchange1 string,symbol string) row format delimited fields terminated by ‘,’ ;
hive> load data inpath ‘/user/hadoop/stocks/stocks.csv’ into table stocks;【加载数据】
hive> drop table stocks;

特点：
1）同一个文件系统(HDFS)中，加载数据的实质是将数据移动至Hive的数据仓库下
2）外部表在做删除操作时，数据依然在Hive的数据仓库下。真正删除的是Mysql中的元数据(模型)

分区表

Hive把表组织成分区，是一种根据分区列（partition column）的值对表进行粗略划分的机制
使用分区来加快数据分片（slice）的查询速度，我们使用关键字 partitioned by 来定义分区列

静态分区
案例说明：
1）创建表 travel，通过时间dt，城市city分区，使用 ‘,’ 字段分隔符

hive (default)> create table travel (id bigint,intro String ) partitioned by (dt String,city String) row format delimited fields terminated by ‘,’;

2）创建三个文件 travel_beijing，travel_changchun，travel_shanghai

travel_beijing:				travel_changchun:			travel_shanghai:
1,changcheng 1day			1,diaosu 2days				1,waitan 1day
2,gugong 2days				2,changying 3days			2,mingzhuta 2 days
3,tiantan 1day				3,jingyuetan 1day			3,dishini 3days
4,niaochao 1day				4,weimanhuanggong 2days		4,gouwu 4days
5,shuilifang 2day
6,wangfujing 3 days

3）从本地加载数据到 travel 表

load data local inpath '/home/dragon/travel_beijing' into table travel partition (dt='2018-08-09',city='beijing');
load data local inpath '/home/dragon/travel_changchun' into table travel partition (dt='2018-08-10',city='changchun');
load data local inpath '/home/dragon/travel_shanghai' into table travel partition (dt='2018-08-10',city='shanghai');

4）查看数据组织(可以看到是多层，一层一层去划分的)
hive (default)> dfs -lsr /user/hive/warehouse/

/user/hive/warehouse/travel
/user/hive/warehouse/travel/dt=2018-08-09
/user/hive/warehouse/travel/dt=2018-08-09/city=beijing
/user/hive/warehouse/travel/dt=2018-08-09/city=beijing/travel_beijing
/user/hive/warehouse/travel/dt=2018-08-10
/user/hive/warehouse/travel/dt=2018-08-10/city=changchun
/user/hive/warehouse/travel/dt=2018-08-10/city=changchun/travel_changchun
/user/hive/warehouse/travel/dt=2018-08-10/city=shanghai
/user/hive/warehouse/travel/dt=2018-08-10/city=shanghai/travel_shanghai

5）查询数据
hive (default)> select * from travel;
可以看到多了两个表属性travel.dt travel.city,但并不存在于真正的数据中，我们称之为分区列

travel.id	travel.intro	travel.dt	travel.city
1	changcheng 1day	2018-08-09	beijing
2	gugong 2days	2018-08-09	beijing
3	tiantan 1day	2018-08-09	beijing
4	niaochao 1day	2018-08-09	beijing
5	shuilifang 2day	2018-08-09	beijing
6	wangfujing 3 days	2018-08-09	beijing
1	diaosu 2days	2018-08-10	changchun
2	changying 3days	2018-08-10	changchun
3	jingyuetan 1day	2018-08-10	changchun
4	weimanhuanggong 2days	2018-08-10	changchun
1	waitan 1day	2018-08-10	shanghai
2	mingzhuta 2 days	2018-08-10	shanghai
3	dishini 3days	2018-08-10	shanghai
4	gouwu 4days	2018-08-10	shanghai

hive (default)> select * from travel where city='shanghai' and id=2;

travel.id	travel.intro	travel.dt	travel.city
2	mingzhuta 2 days	2018-08-10	shanghai

6）查看分区
hive (default)> show partitions travel;

partition
dt=2018-08-09/city=beijing
dt=2018-08-10/city=changchun
dt=2018-08-10/city=shanghai

7）静态分区的缺点：
针对分区列需要手动设置，如果分区数据比较多的话，将会比较麻烦

动态分区

动态分区属性：
在这里插入图片描述
先创建无分区的管理表，再根据所需要的属性进行动态分区（创建分区表、insert数据）

一个股票案例：

1）设置一些动态分区的属性：

set hiveconf:hive.exec.dynamic.partition=true;
set hiveconf:hive.exec.dynamic.partition.mode=nonstrict;
set hiveconf:hive.exec.max.dynamic.partitions=100000;
set hiveconf:hive.exec.max.dynamic.partitions.pernode=100000;
//或者将这些语句直接放入 ~/.hiverc文件中，重启客户端会先加载这个文件

2）创建一个股票表stocks

hive (hive2)> create table stocks
				(exchange1 string,ymd string,price_open float)
	 			row format delimited 
	 			fields terminated by ',';

3）向stocks表加载一些数据
$ hadoop fs -put /mnt/hgfs/share/dataTest/stocks_17.csv /user/hive/warehouse/hive2.db/stocks/

4）创建一个股票分区表stocks_partitions(属性类型都跟stocks相同)

hive (hive2)> create table stocks_partitions
			(exchange1 string,ymd string,price_open float)
 			partitioned by (year string,month string) 
 			row format delimited 
 			fields terminated by ',';
//通过 year、month分区

5）从stocks表中筛选，再向stocks_partitions表中写入数据

 insert overwrite table stocks_partitions
 partition(year,month) 
 select exchange1,symbol,ymd,price_open,
 substr(ymd,0,4) as year,substr(ymd,6,2) as month from stocks;
 //substr()截取字符串,截取ymd属性的前四位为year，从第6位开始截2位作为month

6）等待工作提交完成，查看分区情况
hive(hive2)> dfs -lsr /user/hive/warehouse/hive2.db/

7）如果出现物理内存上限，调大你的虚拟机内存，调整Yarn内存相关参数(调大)
【具体修改哪个参数要看你跑的任务，也可以都修改】

<property>
	<name>mapreduce.map.memory.mb</name>
  	<value>1024</value>
  	<description>为每个map任务从调度程序请求的内存量</description>
</property>

<property>
  	<name>mapreduce.reduce.memory.mb</name>
  	<value>1024</value>
  	<description>每个reduce任务要从调度程序请求的内存的大小</description>
</property>

分桶

含义：实质是将数据分成不同的文件。hive中的分桶和hadoop中的reduce个数相同

首先应该设置采用分桶(默认是false)：
hive> set hive.enforce.bucketing=true;
创建分桶表
```
hive> create table buckets_logs(id int,line string) clustered by (id) into 3 buckets;
```
clustered by (属性名) into 数量 buckets【按照 [属性名] 分多个桶】

写入数据

hive> insert overwrite table buckets_logs select id,line from logs;

logs表与分桶表buckets_logs属性相同且有数据

查看分桶情况
hive> dfs -lsr /user/hive/warehouse/hive.db;【数据库路径】

/user/hive/warehouse/hive.db/buckets_logs
/user/hive/warehouse/hive.db/buckets_logs/000000_0
/user/hive/warehouse/hive.db/buckets_logs/000001_0
/user/hive/warehouse/hive.db/buckets_logs/000002_0

Geek白先生

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive的管理表、外部表、分区表、分桶

管理表管理表又称托管表、内部表，默认创建的表都是管理表案例：hive> create table if not exists stocks (exchange1 string,symbol string,ymd string) row format delimited fields terminated by ‘,’ ;hive>load data inpath ‘/user/...
复制链接

扫一扫

专栏目录