Hive_Hive的数据模型_汇总

最新推荐文章于 2024-07-03 18:54:40 发布
心影_
最新推荐文章于 2024-07-03 18:54:40 发布
阅读量677
点赞数
分类专栏：大数据文章标签： hive
大数据专栏收录该内容
6 篇文章 0 订阅
订阅专栏
Hive的数据模型_数据存储

web管理工具察看HDFS文件系统：http://<IP>:50070/

基于HDFS
没有专门的数据存储格式,默认使用制表符
存储结构主要包括：数据库，文件，表，视图
可以直接加载文本文件
创建表时，可以指定Hive数据的列分隔符和行分隔符。

Hive数据模型
表：
-Table内部表
-Partition分区表
-External Table 外部表
-Bucket Table 桶表
视图：

=============================================================================================
Hive的数据模型_内部表
- 与数据库中的Table在概念上是类似。
- 每一个Table在Hive中都有一个相应的目录存储数据。
- 所有的Table数据(不包括External Table)都保存在这个目录中。

create table t1
(tid int, tname string, age int);

create table t2
(tid int, tname string, age int)
location '/mytable/hive/t2'

create table t3
(tid int, tname string, age int)
row format delimited fields terminated by ',';

create table t4
as
select * from t1;

hdfs dfs -cat /usr/hive/warehouse/tablename/000000_0

alter table t1 add columns(english int);
desc t1;

drop table t1;
if open the recycle bin function of hdfs . we can see the file not delete, but move from one dir to another dir, we can restore it.


=============================================================================================
Hive的数据模型_分区表

准备数据表：
create table sampledata
(sid int, sname string, gender string, language int, math int, english int)
row format delimited fields terminated by ',' stored as textfile;

准备文本数据：
sampledata.txt
1,Tom,M,60,80,96
2,Mary,F,11,22,33
3,Jerry,M,90,11,23
4,Rose,M,78,77,76
5,Mike,F,99,98,98

将文本数据插入到数据表：
hive> load data local inpath '/root/pl62716/hive/sampledata.txt' into table sampledata;

-partition对应于数据库中的Partition 列的密集索引
-在Hive中，表中的一个Partition对应于表下的一个目录，所有的Partition的数据都存储在对应的目录中。

创建分区表：
create table partition_table
(sid int, sname string)
partitioned by (gender string)
row format delimited fields terminated by ',';
向分区表中插入数据：
hive> insert into table partition_table partition(gender='M') select sid, sname from sampledata where gender='M';
hive> insert into table partition_table partition(gender='F') select sid, sname from sampledata where gender='F';

从内部表解析比从分区表解析效率低：

内部表：
hive> explain select * from sampledata where gender='M';
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: sampledata
          Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (gender = 'M') (type: boolean)
            Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: sid (type: int), sname (type: string), 'M' (type: string), language (type: int), math (type: int), english (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1 Data size: 90 Basic stats: COMPLETE Column stats: NONE
              ListSink

Time taken: 0.046 seconds, Fetched: 20 row(s)

分区表：
hive> explain select * from partition_table where gender='M';
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: partition_table
          Statistics: Num rows: 2 Data size: 13 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: sid (type: int), sname (type: string), 'M' (type: string)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 2 Data size: 13 Basic stats: COMPLETE Column stats: NONE
            ListSink

Time taken: 0.187 seconds, Fetched: 17 row(s)

=============================================================================================
Hive的数据模型_外部表

外部表(External Table)
-指向已经在HDFS中存在的数据，可以创建Partition
-它和内部表在元数据的组织上是相同的，而实际数据的存储则有较大的差异。
-外部表侄有一个过程，加载数据和创建表同时完成，并不会移动到数据仓库目录中，只是与外部数据建立一个链接。当删除一个外部表时，仅删除该链接。

1、准备几张相同数据结构的数据txt文件，放在HDFS的/input 目录下。
2、在hive下创建一张有相同数据结构的外部表external_student，location设置为HDFS的/input 目录。则external_student会自动关连/input 下的文件。
3、查询外部表。
4、删除/input目录下的部分文件。
5、查询外部表。删除的那部分文件数据不存在。
6、将删除的文件放入/input目录。
7、查询外部表。放入的那部分文件数据重现。

(1)准备数据：
student1.txt
1,Tom,M,60,80,96
2,Mary,F,11,22,33
student2.txt
3,Jerry,M,90,11,23
student3.txt
4,Rose,M,78,77,76
5,Mike,F,99,98,98

# hdfs dfs -ls /
# hdfs dfs -mkdir /input

将文件放入HDFS文件系统
hdfs dfs -put localFileName hdfsFileDir
# hdfs dfs -put student1.txt /input
# hdfs dfs -put student2.txt /input
# hdfs dfs -put student3.txt /input

(2)创建外部表
create table external_student
(sid int, sname string, gender string, language int, math int, english int)
row format delimited fields terminated by ',' 
location '/input';

(3)查询外部表
select * from external_student;

(4)删除HDFS上的student1.txt
# hdfs dfs -rm /input/student1.txt

(5)查询外部表
select * from external_student;

(6)将student1.txt 重新放入HDFS input目录下
# hdfs dfs -put student1.txt /input

(7)查询外部表
select * from external_student;

=============================================================================================
Hive的数据模型_桶表

对数据进行HASH运算，放在不同文件中，降低热块，提高查询速度。
例如：根据sname进行hash运算存入5个桶中。

create table bucket_table
(sid int, sname string, age int)
clustered by (sname) into 5 buckets;

=============================================================================================
Hive的数据模型_视图
-视图是一种虚表，是一个逻辑概念；可以跨越多张表
-视图建立在已有表的基础上，视图赖以建立的这些表称为基表。
-视图可以简化复杂的查询。

创建视图
create view viewName
as
select data from table where condition;

查看视图结构
desc viewName;

查询视图
select * from viewName;