Hive数据库以及表的操作

LBJ_小松鼠

已于 2022-08-21 20:58:42 修改

阅读量678

点赞数

分类专栏： Hive和数仓文章标签： hive 数据库 hadoop

于 2020-12-12 10:55:20 首次发布

本文链接：https://blog.csdn.net/m0_49834705/article/details/110880805

版权

Hive和数仓专栏收录该内容

13 篇文章 2 订阅

订阅专栏

1.Hive的数据库操作

1:创建数据库:
	create database if not exists myhive;
	if not exists 最好写,方便后面执行shell脚本.
解释:
	1:当我们在hive每创建一个数据库，则Hive会自动在HDFS上创建一个文件
	夹:/user/hive/warehouse/myhive.db #数据库名字
	
	说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的(默认就在里面,配置里面看不到)
	<name>hive.metastore.warehouse.dir</name>
	<value>/user/hive/warehouse</value>
	
2:查看数据库信息
	desc  database  myhive;
	解释：查看的信息就是元数据，这些元数据保存在mysql的hive数据库DBS表中

3:删除数据库:
	drop database mytest3 cascade;
	强制删除

4:创建数据库并且制定在HDFS上的位置:
create database myhive2 location '/myhive2';  #根目录下

2.Hive表操作

2.1 表分类

内部表（管理表）
	create  table stu(); 私有
外部表
	create external table stu(); 公有

2.2 内部表操作

1.创建表 
	create tables if not exists stu(id int,name string,address 
	string);
	解释：每创建一个表，Hive会自动在所属的数据库目录创建一个文件夹

2.插入语句
	insert into stu values(1,'玩具','湖北');
	解释：插入数据内部走的是MapReduce,每插入一次会生成一个小文件,所以
		 插入会很慢
        注意：默认情况下Hive字段之间的分隔符是'\001'
3.创建表并制定分隔符
	create table stu4(id int , name string) row format delimited 
	fields terminated by '\t';

4.创建表指定表存储格式和存储位置
	create table stu5(id int , name string) row format delimited 
	fields terminated by '\t' stored as textfile location '/stu5';

解释:
	stored as textfile  指定表文件存储格式是文本格式（默认）可以不写
	location '/stu5'  指定表文件目录的存储位置，但是理论上该表还是属于原数据库

5:创建表并复制表结构和表数据
 	create table stu1 as select * from stu;

6:创建表只复制表结构
	create table stu4 like stu2;

7:查看表结构详情
	desc  formatted  表名;

8:删除表
	drop table 表名;

总结:
内部表删除之后，元数据信息和表数据全部删除

2.3外部表操作

在创建表的时候可以指定external关键字创建外部表,外部表对应的文件存储在location指定的hdfs目录下,向该目录添加新文件的同时，该表也会读取到该文件(当然文件格式必须跟表定义的一致)。删除hive外部表的时候，数据仍然存放在hdfs当中，不会删掉, 删除的只是元数据。

1.数据装载载命令Load:
load data [local] inpath '/export/data/datas/student.txt' 
[overwrite]  into table student [partition (partcol1=val1,…)];

解释: 
	  load data -加载数据
	  有local-表示从本地linux加载数据到Hive
	  无local-表示从HDFS上加载数据到Hive(实际用的多).
		
	  inpath-表示加载数据的路径 
	  
	  有overwrute-表示覆盖表中的数据
	  无overwrite-追加表数据
		
	  into table 表名-具体加载到那张表
	  partition-表示上传到指定分区

2.3.1具体外部表操作

1.创建老师表
create external table teacher(t_id string,t_name string) row 
format delimited fields terminated by '\t';

2.创建学生表
create external table student (s_id string,s_name string,s_birth 
string , s_sex string) row format delimited fields terminated by '\t';

3.从本地文件系统向表中加载数据
load data local inpath '/export/data/hivedata/student.txt' into table student; (赋值  本地还在)

从HDFS向表中加载数据:
	其实就是一个移动文件的操作
	需要提前将数据上传到hdfs文件系统，
	hadoop fs -mkdir -p /hivedatas     -- 在HDFS上转件文件夹
	cd /export/data/hivedatas
	hadoop fs -put teacher.csv /hivedatas/   --将本地从Linux上传到HDFS
最后在Hive里面加载数据:
load data inpath '/hivedata/student.txt'[HDFS上的文件路径] into table student; (相当于剪切  实际用的多)

4、加载数据并覆盖已有数据
load data local inpath '/export/data/hivedatas/student.txt'[本地文件路径] overwrite  into table student;

2.4 复杂类型操作

1:array数组类型,Array中存放相同类型的数据

源数据:
说明:name与work_city之间制表符(\t)分隔，work_city中元素之间逗号分隔
zhangsan	  beijing,shanghai,tianjin,hangzhou
wangwu   	changchun,chengdu,wuhan,beijing

建表:
create external table hive_array(name string, work_city array<string>)
row format delimited fields terminated by '\t'
collection items terminated by  ',';

导入数据（HDFS导入）
load data  inpath '/hivedatas/work_city.txt' overwrite into table hive_array;

常用查询：
-- 查询所有数据
select * from hive_array;
-- 查询work_city数组中第一个元素
select name, work_city[0] as city from hive_array;
-- 查询location数组中元素的个数
select name, size(work_city) as city_size from hive_array;
-- 查询location数组中包含tianjin的信息
select * from hive_array where array_contains(work_city,'tianjin'); 

#array_contains 一个函数

map类型,map就是描述key-value数据

源数据:
说明：字段与字段分隔符: “,”；需要map字段之间的分隔符："#"；map内部k-v分隔符：":"
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

建表语句
create table hive_map(
id int, name string, members map<string,string>, age int
)
row format delimited
fields terminated by ','
collection items terminated by  '#' 
map keys terminated by  ':'; 

数据从HDFS导入Hive:
load data inpath '/hivedatas/hive_map.txt' overwrite into table hive_map;

常用查询
select * from hive_map;
#根据键找对应的值
select id, name, members['father']  as father, members['mother']  as
mother, age from hive_map;
#获取所有的键
select id, name, map_keys(members) as relation from hive_map;
#获取所有的值
select id, name, map_values(members) as relation from hive_map;
#获取键值对个数
select id,name,size(members) as num from hive_map;
#获取有指定key的数据
select * from hive_map where array_contains(map_keys(members), 'brother');
#查找包含brother这个键的数据，并获取brother键对应的值
select id,name, members['brother'] as brother from hive_map where array_contains(map_keys(members), 'brother');

struct类型

源数据：
说明：字段之间#分割，第二个字段之间冒号分割
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

建表语句
create table hive_struct(
ip string, info struct<name:string, age:int>
)
row format delimited
fields terminated by '#'
collection items terminated by ':';

load data  inpath '/hivedatas/hive_struct.txt'[HDFS路径] overwrite into table hive_struct;

常用查询
select * from hive_struct;
#根据struct来获取指定的成员的值
select ip, info.name from hive_struct;

3 内部表,外部表之间的转换(tblproperties)

1、查询表的类型
desc formatted student; (详细查看)
	Table Type:            EXTERNAL_TABLE
desc 表明;  粗略查看

2、修改外部表student为外内部表
alter table student set tblproperties('EXTERNAL'='FALSE');

3、修改内部表student为外部表
alter table student set tblproperties('EXTERNAL'='TRUE');
注意:全部大写

4 分区表(partitioned ,partition)

分区不是独立的表模型,要和内部表或者外部表结合:
内部分区表
外部分区表

在hive中，分区就是分文件夹

1.创建表(单个分区)
create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

解释:partitioned by  固定写法  表示分区
	month string 要分区的字段  但是严格意义上不算字段;  month名字随意

2.创建一个表带多个分区
create table score2 (s_id string,c_id string, s_score int) 
partitioned by (year string,month string,day string) row format 
delimited fields terminated by '\t';
注意:创建多个分区表 可以在HDFS上看见多个层级的文件夹


查看分区
show  partitions  score;
添加一个分区
alter table score add partition(month='202008');

同时添加多个分区
alter table score add partition(month='202009') partition(month = '202010');

删除分区
alter table score drop partition(month = '202010');

多分区联合查询使用union  all来实现
select * from score where month = '202006' union all select * 
from score where month = '202007';

只能清空管理表，也就是内部表
truncate table score4;

5 hive表中加载数据

1.直接向分区表中插入数据
通过insert into方式加载数据
create table score3 like score;
insert into table score3 partition(month ='202007') values ('001','002',100);

2.通过查询方式加载数据
create table score4 like score;
insert overwrite table score4 partition(month = '202006') select s_id,c_id,s_score from score;   #overwrite不能省略

3.通过查询插入数据
通过load方式加载数据
create table score5 like score;
load data local inpath '/export/data/hivedatas/score.csv' overwrite into table score5 partition(month='202006');

6 分桶表(clustered by(c_id) into 3 buckets)

分桶就是将数据划分到不同的文件，其实就是MapReduce的分区,分桶表不能直接添加数据,要借助临时表.

1.开启hive的桶表功能(如果执行该命令报错，表示这个版本的Hive已经自动开启了分桶功能，则直接进行下一步)
set hive.enforce.bucketing=true;

2.设置reduce的个数
set mapreduce.job.reduces=3;  

3.创建分桶表
create table course (c_id string,c_name string,t_id string) 
clustered by(c_id) into 3 buckets row format delimited fields 
terminated by '\t';

桶表的数据加载，由于桶表的数据加载通过hadoop fs  -put文件或者通过load  data均不好使，只能通过insert  overwrite

创建普通表，并通过insert  overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去

创建普通表：
create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

普通表中加载数据
load data local inpath '/export/date/hivedatas/course.csv' into table course_common;

通过insert  overwrite给桶表中加载数据
insert overwrite table course 
select * from course_common cluster by(c_id);

LBJ_小松鼠

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hive数据库以及表的操作

1.Hive的数据库操作1:创建数据库: create database if not exists myhive; if not exists 最好写,方便后面执行shell脚本.解释: 1:当我们在hive每创建一个数据库，则Hive会自动在HDFS上创建一个文件夹:/user/hive/warehouse/myhive.db #数据库名字说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的(默认就在里面,配置里面看不到) <name>hi
复制链接

扫一扫

专栏目录