hive的基本操作

最新推荐文章于 2022-11-25 08:52:35 发布

gzj_1101

最新推荐文章于 2022-11-25 08:52:35 发布

阅读量538

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/gzj_1101/article/details/101000724

版权

大数据专栏收录该内容

0 篇文章 0 订阅

订阅专栏

title: hive的基本操作
date: 2019-09-17 21:57:11
tags: [Hive, 基本操作]
categories:

大数据
hive编程

以后博客主要在**https://geroge-gao.github.io/**上更新，欢迎访问。

表的增删改查操作

创建表

使用if not exists 如果存在则跳过，comment为注释。

create table if not exists mydb.employees(
	name string comment 'Employee name',
	salary float comment 'Employee salary',
	subordinates array<string> comment 'Names of subordinates',
	deduction map<string, float>,
	address struct<street:string, city:string, state:string, zip:int> comment 'Home address')

comment 'descriptions of table'
location '/user/hive/warehouse/mydb.db/employees';

描述表

显示表的字段和结构，使用desc/describe

--显示表的字段和数据类型
desc table_name;
--显示对应字段的数据类型
desc table_name.columns

管理表和外部表

管理表是hive创建的表，由hive控制其生命周期，hive默认情况下会将数据存在在配置文件指定的目录当中，由hive.metastore.warehouse.dir指定。当使用hive删除表的时候，对应的数据也会被删除，即hdfs文件系统中的数据也会被删除。管理表的缺点在于无法共享数据，比如利用pig等工具创建的数据，hive对其没有权限。当使用hive查询这些数据的时候就可以使用一个外部表指向这份数据，而不需要对其的权限。外部表需要使用external修饰。

create external table if not exists stocks(
	exchange		string,
	symbol			string,
	ymd				string,
	price__open		string,
	price__high		string,
	price__low		string,
	price__close	float,
	volume			int,
	price_adj_close	float)
row format delimited fields terminated by ',' '分隔符为,'
location 'data/stocks';

加上external字段值后，删除表并不会删除这份数据，不过描述标的元数据信息会被删除。元数据可以理解为对该表的描述信息，而不是表内数据。

需要注意的是如果语句省略了external关键字同事源表是外部表，那么新表也是外部表，如果源表是管理表，新表也是管理表。在加上external之后，无论源表是管理表还是外部表，新表都是外部表。

分区表

在建表过程中，会根据分区字段创建对应目录，优点在于分层存储，可以加快查询速度，而缺点在于一些数据存在于文件目录下，但是hive只能从表中读取数据，因此会造成资源浪费。分区表创建：

create table employees(
	name			string,
	salary			float,
	subordinates	array<string>,
	deduction		map<string,float>,
	address			struct<street:string, city:string, city:string, state:string>
)
partitioned by (country string, state string);

在建表的时候hive在hdfs上的目录为…/employees/country/state

查看表中存在所有分区

show partitions employees;

查询指定分区

show partitions employees partition(country='CHINA')

删除表

drop table if exists table_name;

对于管理表，表的元数据和表内数据都会被删除。

修改表

###　表的重命名

将表从ａ重命名为ｂ

alter table a rename to b;

增加表分区


alter table add partiion
--在一个查询语句中增加多个分区
alter table table_name if not exists 
partition(...) location '/user/hive/warehouse/a'
partition(...) location '/user/hive/warehouse/b'
partition(...) location '/user/hive/warehouse/c'

修改列的信息

将列名从a改到b，并且将其移到serverity字段后面。

alter table table_name change 
column a b type_name '修改列的数据类型'
comment 'xxx'
after serverity

增加新的列

alter table table_name add column(
    app_name string comment 'Application name',
    session_id long comment 'the current session id';
)

删除或替换列

alter table  table_name replace columns（
	hour_mins_secs INT comment 'xxx'
	severity string comment 'xxx';
)

将之前的列都删除，只留下replace的列

修改表的属性

alter table table_name set tblproperties('notes'='xxx');

修改表的存储属性

alter table table_name partition(a=xxx,b=xxx,c=xxx) set fileformat sequencefile;

指定对应的分区中的表，并且重新设置其格式。

加载和导出数据

从本地加载数据

load data local inpath '/home/hadopp/data.txt'
overwrite into table employees
partition(country='US',state='CA');

需要注意的是创建分区表的时候使用的是partition by。如果目录不存在，hive会先创建分区目录。

通过查询语句加载数据

insert overwrite table employees
partition(country='US',state='CA')
select * from table_name where xxx=xxx;

通过查询语句建表

create table if not exists table_name as 
select * from table_name_b;

导出数据

--方法一，谁用hadoop提供的工具
hadoop fs -cp source_path target_path
--方法二
insert overwrite local directory '/home/hadoop/employees'
select * from employees;

hive的连接操作

table stu

1	chenli	21
2	xuzeng	22
3	xiaodan	23
4	hua		24

table sub

1	chinese
2	english
3	science
5	nature

内连接

inner join，关键字 join on。仅列出两个表中符合连接条件的数据。

select a.*,b.* from stu a join sub b on a.id=b.id
--结果
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science

join后面连接表，而on指定连接条件。

左连接和右连接

左连接，显示左边表的所有数据，如果右边表有与之对应的数据则显示，否则显示为NULL。

select a.* from stu a left outer join sub b on a.id=b.id;
--结果
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science
4	hua		24	NULL	NULL

右连接与左连接相反，使用的关键字为 right outer join xxx on xxx。

标准查询关键字执行顺序为 from->where->group by->having->order by。

全连接

左表和右表都显示，如果没有对应数据，则都显示为NULL

select a.* from stu a full outer join sub b on a.id=b.id;
--结果
1		chenli	21		1		chinese
2		xuzeng	22		2		english
3		xiaodan	23		3		science
4		hua		24		NULL	NULL
NULL	NULL	NULL	5		nature

左半开连接

左半开连接。left semi join，语法与左连接不一样，只能选择出左边表的数据，此数据符合on后面的条件。

select a.* from stu a left semi join sub b on a.id=b.id;
--结果
1	chenli	21
2	xuzeng	22
3	xiaodan	23
--下列语句会报错
select a.*,b.* from stu a left semi join sub b on a.id=b.id;

笛卡尔连接

select a.*,b.* from cl_student a join cl_stu_sub b;
--结果
1	chenli	21	1	chinese
1	chenli	21	2	english
1	chenli	21	3	science
1	chenli	21	5	nature
2	xuzeng	22	1	chinese
2	xuzeng	22	2	english
2	xuzeng	22	3	science
2	xuzeng	22	5	nature
3	xiaodan	23	1	chinese
3	xiaodan	23	2	english
3	xiaodan	23	3	science
3	xiaodan	23	5	nature
4	hua		24	1	chinese
4	hua		24	2	english
4	hua		24	3	science
4	hua		24	5	nature