Hive中的表及其创建

最新推荐文章于 2021-11-03 10:39:52 发布

不会秃头的小白菜

最新推荐文章于 2021-11-03 10:39:52 发布

阅读量413

点赞数

本文链接：https://blog.csdn.net/qq_45536740/article/details/103947174

版权

管理表

内部表也称之为MANAGED_TABLE；默认存储在/user/hive/warehouse下，也可以通过location指定；删除表时，会删除表数据以及元数据；
create table if not exists t_user(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array,
card map<string,string>,
address structcountry:string,city:string
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘|’
map keys terminated by ‘>’
lines terminated by ‘\n’
stored as textfile;

外部表

外部表称之为EXTERNAL_TABLE；在创建表时可以自己指定目录位置(LOCATION)；删除表时，只会删除元数据不会删除表数据；

create external table if not exists t_user1(
id int,
name string,
sex boolean,
age int,
salary double,
hobbies array,
card map<string,string>,
address structcountry:string,city:string
)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘|’
map keys terminated by ‘>’
lines terminated by ‘\n’
stored as textfile
location ‘hdfs:///hive_db/t_user’;

分区表

Hive中的表对应为HDFS上的指定目录，在查询数据时候，默认会对全表进行扫描，这样时间和性能的消耗都非常大。分区为HDFS上表目录的子目录，数据按照分区存储在子目录中。如果查询的where子句的中包含分区条件，则直接从该分区去查找，而不是扫描整个表目录，合理的分区设计可以极大提高查询速度和性能。这里说明一下分区表并非Hive独有的概念，实际上这个概念非常常见。比如在我们常用的Oracle数据库中，当表中的数据量不断增大，查询数据的速度就会下降，这时也可以对表进行分区。表进行分区后，逻辑上表仍然是一张完整的表，只是将表中的数据存放到多个表空间（物理文件上），这样查询数据时，就不必要每次都扫描整张表，从而提升查询性能。在Hive中可以使用PARTITIONED BY子句创建分区表。表可以包含一个或多个分区列，程序会为分区列中的每个不同值组合创建单独的数据目录。

CREATE EXTERNAL TABLE t_employee(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LOCATION '/hive/t_employee';

7369	SMITH	CLERK	7902	1980-12-17 00:00:00	800.00
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00	1600.00	300.00
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00	1250.00	500.00
7566	JONES	MANAGER	7839	1981-04-02 00:00:00	2975.00
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00	1250.00	1400.00
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00	2850.00
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00	2450.00
7788	SCOTT	ANALYST	7566	1987-04-19 00:00:00	1500.00
7839	KING	PRESIDENT		1981-11-17 00:00:00	5000.00
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00	1500.00	0.00
7876	ADAMS	CLERK	7788	1987-05-23 00:00:00	1100.00
7900	JAMES	CLERK	7698	1981-12-03 00:00:00	950.00
7902	FORD	ANALYST	7566	1981-12-03 00:00:00	3000.00
7934	MILLER	CLERK	7782	1982-01-23 00:00:00	1300.00

0: jdbc:hive2://CentOS:10000> load data local inpath '/root/baizhi/t_employee' overwrite into table t_employee partition(deptno=10);

分桶表

分区表是为了将文件按照分区文件夹进行粗粒度文件隔离，但是分桶表是将数据按照某个字段进行hash计算出所属的桶，然后在对桶内的数据进行排序。

CREATE EXTERNAL TABLE t_employee_bucket(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT)
    CLUSTERED BY(id) SORTED BY(salary ASC) INTO 6 BUCKETS  
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LOCATION '/hive/employee_bucket';

0: jdbc:hive2://CentOS:10000> set hive.enforce.bucketing = true;
No rows affected (0.024 seconds)
0: jdbc:hive2://CentOS:10000> INSERT INTO TABLE t_employee_bucket(id,name,job,manager,hiredate,salary,deptno) SELECT id,name,job,manager,hiredate,salary,deptno  FROM t_employee where deptno=10;

不会秃头的小白菜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive中的表及其创建

管理表内部表也称之为MANAGED_TABLE；默认存储在/user/hive/warehouse下，也可以通过location指定；删除表时，会删除表数据以及元数据；create table if not exists t_user(id int,name string,sex boolean,age int,salary double,hobbies array,card...
复制链接

扫一扫