Hive数据库及表操作(二)

最新推荐文章于 2024-09-05 12:30:40 发布

蜗牛杨哥

最新推荐文章于 2024-09-05 12:30:40 发布

阅读量481

点赞数

本文链接：https://blog.csdn.net/u014635374/article/details/105055391

版权

创建外部表

create external table testdb.emp2(

id int,

name string)

row format delimited fields

terminated by '\t' location '/input/hive';

创建外部表时,如果指定LOCATION关键字,则将表创建于指定的HDFS位置,

然后在本地目录/home/hadoop下创建tmp.txt,并将该文件导入表emp2中,emp.txt的内容如下(字段之间以tab键隔开)

1 xiaoming

2 zhangsan

3 wangqiang

4 laogao

5 yanghong

hive (testdb)> load data local inpath '/home/hadoop/emp.txt' into table testdb.emp2;

Loading data to table testdb.emp2

导入内容后查看数据

hive (testdb)> select * from emp2;

emp2.id emp2.name

1 xiaoming

2 zhangsan

3 wangqiang

4 laogao

5 yanghong

删除外部表时，不会删除实际数据,但元数据会被删除

hive (testdb)> drop table testdb.emp2;

然后查看HDFS目录/input/hive中的数据,发现数据仍然存在。

[hadoop@centoshadoop1 ~]$ hdfs dfs -ls -R /input/hive

-rwxr-xr-x 3 hadoop supergroup 54 2020-03-23 13:59 /input/hive/emp.txt

创建外部表时,使用LOCATION关键字,可以将表与HDFS中已经存在的数据进行关联。

执行下面的命令,在数据库testdb中创建emp3表,并指定表数据所在的HDFS中的存储目录为/input/hive(该目录已经存储数据文件emp.txt)

create external table testdb.emp3(

id int,

name string)

row format delimited fields

terminated by '\t' location '/input/hive';

执行下面的命令,查询emp3的所有数据,发现该表已与数据文件emp.txt相关联了

hive (testdb)> select * from emp3;

emp3.id emp3.name

1 xiaoming

2 zhangsan

3 wangqiang

4 laogao

5 yanghong

Hive内部表和外部表区别

操作	管理表(内部表)	外部表
CREATE/LOAD	将数据复制或移动到数据仓库目录	创建表时关联外部数据或将数据存储于外部目录(也可以存储于数据仓库目录,但不常用)
DROP	元数据和实际数据被一起删除	只删除元数据

注意：在实际开发中，外部表一般创建于数据仓库路径之外,因此创建外部表时常常指定LOCATION关键字。在多数情况下,内部表与外部表没有太大的区别(删除表除外)。一般来说,如果所有数据处理都由Hive完成,则应该使用内部表;如果同一个数据集既要用Hive处理又要用其他工具处理,则应该使用外部表。

分区表

Hive可以使用关键字PARTITIONED BY对一张表进行分区操作。可以根据某一列的值将表分为多个分区,每个分区对应数据仓库中的一个目录(相当于根据列的值将表数据进行分目录存储)。当查询数据的时候,根据WHERE添加Hive只查询指定的分区而不需要全表扫描,从而可以加快数据的查询速度。在HDFS文件系统中,分区实际上只是在目录下嵌套的子目录.

Hive中的分区好比关系型数据库中的索引。例如有一张数据量非常大的学生表”student”,

现在需要查询年龄age等于18的所有数据。在关系型数据库中,需要对年龄age列建立索引,当查询时,数据库会先从索引列中找到匹配的数据,然后在根据匹配数据查询一整行数据。如果不建立索引则将扫描每一行数据,然后取出该行数据的字段age进行比对。在Hive中,创建表的时候可以将列age设置为分区列,然后向表”student”添加或导入数据时，需要指定分区值(即数据所属的分区),列如设置分区值为age=18,则会在目录下生成一个名为age=18的子目录,年龄为18的所有数据会存储于该目录下。当查询age=18的时候，只查询指定的分区即可。

student分区后的目录结构如下图：

此外,Hive还支持在创建表的同时指定多个分区列。列如,将”student”表的年龄age和性别列

gender同时指定为分区列,命令如下:

create table testdb.student(

id int,

name string)

partitioned by (age int,gender string)

row format delimited fields terminated by '\t';

需要注意的是,创建表时指定的表的列中不应该包含分区列,分区列需要使用关键字partitioned by 在后面单独指定。Hive将把分区列排在普通列之后。

查看表结构信息如下:

hive (testdb)> desc student;

col_name data_type comment

id int

name string

age int

gender string

建立测试数据集

student_data.txt

1 zhangsan male

2 yanghong male

3 xiaoming female

4 xiaomei female

5 dawei male

hive (testdb)> load data local inpath '/home/hadoop/student_data.txt' into table testdb.student partition(age=17,gender='male');

Loading data to table testdb.student partition (age=17, gender=male)

注意:

导入的数据文件中列的顺序必须与创建表时指定的顺序一致,且不需要写入列age与gender的数据,Hive会自动将分区列数据对应放入。导入数据时必须指定分区值,如果数据文件file1.txt中存在第四列,无论第四列值是多少,Hive会自动将第四列替换为分区列,且值为指定的分区值,本例为17,male

查看数据存储情况:

[hadoop@centoshadoop1 ~]$ hadoop fs -ls -R /home/hadoop/hive/data

/home/hadoop/hive/data/testdb.db/student

/home/hadoop/hive/data/testdb.db/student/age=17

/home/hadoop/hive/data/testdb.db/student/age=17/gender=male

/home/hadoop/hive/data/testdb.db/student/age=17/gender=male/student_data.txt

同理,将年龄为20的学生数据导入到文件student_data2.txt,然后导入到表”student”中,

7 zhangsan2 male

8 yanghong4 male

9 xiaoming12 female

10 xiaomei1 female

11 dawei12 male

12 laoban4 female

hive (testdb)> load data local inpath '/home/hadoop/student_data2.txt' into table testdb.student partition(age=20,gender='female');

Loading data to table testdb.student partition (age=20, gender=female)

导入的分区表数据的查询:

hive (testdb)> select id,name,age from testdb.student where age = 17;

id name age

1 zhangsan 17

2 yanghong 17

3 xiaoming 17

4 xiaomei 17

5 dawei 17

6 laoban 17

执行上述命令后,Hive将不会扫描分区age=20中的文件。上述查询结果中的分区列的值是从分区目录名中读取的，因此分区列在数据文件中并不存在

也可以使用union关键字将多个分区联合查询,命令如下:

select * from student where age =17

union

select * from student where age =20;

执行上述命令后，Hive将开启MapReducer任务进行数据的读取与合并

Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 10.18 sec HDFS Read: 17031 HDFS Write: 470 SUCCESS

Total MapReduce CPU Time Spent: 10 seconds 180 msec

_u1.id _u1.name _u1.age _u1.gender

1 zhangsan 17 male

2 yanghong 17 male

3 xiaoming 17 male

4 xiaomei 17 male

5 dawei 17 male

6 laoban 17 male

7 zhangsan2 20 female

8 yanghong4 20 female

9 xiaoming12 20 female

10 xiaomei1 20 female

11 dawei12 20 female

12 laoban4 20 female

增加分区

使用修改表关键字alter可以为现有分区列增加一个分区目录。例如,在表”student”中增加一个分区age = 21,命令如下:

hive (testdb)> alter table testdb.student add partition(age=21,gender='male');

[hadoop@centoshadoop1 ~]$ hadoop fs -ls -R /home/hadoop/hive/data

/home/hadoop/hive/data/testdb.db/student/age=21

/home/hadoop/hive/data/testdb.db/student/age=21/gender=male

若要同时增加多个分区,命令如下:

alter table testdb.student add partition(age=30,gender='male') partition(age=50,gender='male');

需要注意的是,以上命令只是为现有的分区列增加一个或多个分区目录,并不是增加其他的分区列.

删除分区列

alter table testdb.student drop partition (age=21);

同时删除多个分区列

hive (testdb)> alter table testdb.student drop partition (age=30) ,partition (age = 50);

Dropped the partition age=30/gender=male

Dropped the partition age=50/gender=male

查看分区:

查看表的所有分区： show partitions testdb.student;

hive (testdb)> show partitions testdb.student;

partition

age=17/gender=male

age=20/gender=female

动态分区

使用LOAD关键字导入数据到表中,默认使用的是静态分区(即需要指定分区值,Hive根据分区值创建分区目录)

分桶表

在Hive中,可以将表或者分区进一步细分为桶,桶是对数据进行更细粒度的划分,以便获得更高的查询效率。桶在数据存储上与分区不同的是,一个分区会存储为一个目录,数据文件存储于该目录中,而一个桶将存储为一个文件,数据内容存储于该文件中.

在Hive中,可以直接在普通表上创建分桶,也可以在分区表中创建分桶,存储模型如下图:

在创建表的时候需要指定分桶的列已经桶的数据,当添加数据的时候,Hive将对分桶列的值进行哈希计算,并将结果除以桶的个数,最后取余数。然后根据余数将数据分配到不同的桶中(每个桶都会有自己的编号,从0开始。余数为0的行数据被分配到编号为0的桶中，余数是1的行数据被分配到编号为1的桶中,依次类推)。这样可以尽量将数据平均分配到各个桶中。基于这种方式,分桶列中的值相同的行数据将被分配到同一个桶中。这中分配方式与MapReduce中相同的key被分配到同一个Reduce中的原理是类似的。

那么到底如何通过分桶提高查询效率呢？举个例子,Hive中有两张表：用户表和订单表。现需要对两张表进行JOIN连接查询,如果两张表的数据量非常大,连接查询将消耗大量的时间，这时候就可以对连接字段进行分桶。如下图,对订单表的uid列进行分桶，分桶个数为3，相同uid所在的一整行数据则会被分配到同一个桶中。当再次对表进行JOIN连接查询时,将根据uid快速定位到数据所在的桶,只从桶中查找数据,而不需要进行全表扫描,大大提高了查询效率。