Hive表SQL操作

最新推荐文章于 2024-05-10 07:38:43 发布

摇篮里的小樱桃

最新推荐文章于 2024-05-10 07:38:43 发布

阅读量117

点赞数

分类专栏： Hive 文章标签： hive

本文链接：https://blog.csdn.net/m0_57479841/article/details/134845538

版权

Hive 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

1. 通过 select 数据集创建表语法格式 ( 只能是内部表，不支持分区，分桶 )

create table table_name [stored as orc]
as
select ...

--如果通过select一个分区表中的数据创建表，则创建的新表的字段没有分区字段，只是把select的表的分区字段看成正常字段保存数据，因此新表不是分区表

2. 复制一个空表

--只复制表结构，不复制表数据
--如果复制的表是分区表，则新创建的表也是分区表，表结构统一
create table table_name like table_name;

3. 显示表列表

--显示所有表
show tables;
--最通用的写法
show tables like '*user*';
--显示函数
show functions like '*count*';
--以user开头的
show tables like 'user*';
--以other结尾的
show tables like '*other';

4. 删除表

drop table [if exists] table_name [purge];
--对于内部表来说，删除表的操作本质上是先删除表的元数据，然后再把表对应的HDFS上的数据放到回收站(hadoop fs -rm )，hadoop是配了回收站，加了purge相当于表对应在HDFS上的数据直接删除，不进回收站，且不能恢复，一般不用它。
--对于外部表来说，删除表操作只删除元数据，不会删除在HDFS上的存储数据，加purge也不会删除。

--回收站路径
/user/hadoop/.Trash/Current

5. 清除表

truncate table table_name [partition partition_spec];
--可以删除表和删除分区数据，和drop的区别是不删除元数据(表结构)，只删除数据，外部表是不能truncate操作的。

6. 修改表名

--改表名
alter table table_name rename to new_table_name;
--说明：
内部表修改了表名后，表对应的存储文件地址也跟着改变，相当于做了HDFS的目录重命名
外部表不会改对应的location地址

7. 添加表分区

内部表添加表分区，自动创建目录。

外部表添加表分区，自动创建目录。

--创建分区表外部表
create external table ext_task(
word string,
num int
)partitioned by (taskname string)
row format delimited fields terminated by '\t'
location '/user/cz/ext_task';
--创建分区表内部表
create table inner_task(
word string,
num int
)partitioned by (taskname string)
row format delimited fields terminated by '\t';

--分区表外部表添加分区
alter table ext_task add if not exists partition(taskname='wordcount') location 'wordcount';
alter table ext_task add if not exists partition(taskname='maxword') location 'maxword';
alter table ext_task add if not exists partition(taskname='sortword') location 'sortword';
--分区表内部表添加分区
alter table inner_task add if not exists partition(taskname='wordcount') location 'wordcount';
alter table inner_task add if not exists partition(taskname='maxword') location 'maxword';
alter table inner_task add if not exists partition(taskname='sortword') location 'sortword';

8. 删除表分区

--删除表分区的语法
alter table table_name drop if exists partition partition_spec[,partition partition_spec,...]
--说明:
内部表删除分区，同时删除分区对应的目录
外部表删除分区不删除分区对应的目录

--内部表删除分区
alter table inner_task drop if exists partition(taskname='sortword');
--外部表删除分区
alter table ext_task drop if exists partition(taskname='sortword');

9. 修改分区路径

--修改表或分区路径的语法
alter table table_name [partition partition_spec] set location "new location";
--说明
内部表/外部表 修改分区路径，元数据修改，但hdfs目录没有创建，等导入数据时创建，或者自己主动创建
insert into table inner_task partition(taskname='wordcount') select word,num from word_avro;

10. 分区重命名

--分区重命名语法:
alter table table_name partition partition_spec rename to partition partition_spec;
--说明:
如果是内部表，分区重命名，分区对应的地址也会跟着改变，外部表不会。

--内部表分区重命名会改变分区对应的hdfs的目录
alter table inner_task partition (taskname='maxword') rename to partition (taskname='maxword01');
--外部表分区重命名不会改变分区对应的hdfs的目录
alter table ext_task partition (taskname='maxword') rename to partition
(taskname='maxword01');

11. 添加和修改字段

--增加表字段，使用新列集合替换现有数据列的语法
alter table table_name add | replace columns (col_name data_type [comment col_comment],...)
说明:
add columns 可以在表列的最后和分区字段前面增加字段
示例:
alter table ext_test add columns(test_col string);

--修改表字段
--语法:
alter table table_name change [column] col_old_name col_new_name column_type [comment col_comment] [first | after column_name]
--示例:
alter table ext_test_c change column test_col test_col_new string;

12. 数据加载

（1）load加载数据

向表中添加数据除了可以使用insert语法 (不推荐)，还可以用hadoop fs -put的方式向表中添加数据。还有一种比较简单的用法就是可以直接通过load的方式加载数据。

--load数据加载语法格式
load data [local] inpath 'filepath' [overwrite] into table tablename [partition(partcol1=val1, partcol2=val2 ...)]

说明:

（1）Hive的数据加载不会对本地数据文件做任何处理，只是将文件或目录中的所有文件拷贝到表定义的数据目录，分桶表使用load加载数据会生成mapreduce任务，将数据分到多个桶文件进行存放

（2）指定local 本地文件上传，如果没有指定local，则是从HDFS上传数据。

（3）文件加载Hive没有做严格校验，文件格式和压缩选项等匹配需要用户自己保证。

（4）分区表要指定具体加载数据分区

（5）如果指定overwrite会覆盖相应表数据或分区数据，相当于rm原有目录数据，然后上传新数据文件

示例一：将HDFS中的文件load到表中

示例二：将本地文件load到表中

-- 分区表外部表
create external table ext_task1(
word string,
num int
) partitioned by (taskname string)
row format delimited fields terminated by '\t'
location '/user/cz/ext_task1';

--创建分区wordcount1
alter table ext_task1 add if not exists partition(taskname='wordcount') location 'wordcount';
--再load数据到表分区wordcount中
--①将数据放到hdfs中
hadoop fs -put word /user/cz
hadoop fs -ls /user/cz|grep word
load data inpath '/user/cz/word' into table ext_task1 partition(taskname='wordcount');
--②linux本地上传
scp word hadoop@nn2:/home/hadoop
load data local inpath '/home/hadoop/word' into table ext_task1 partition (taskname='wordcount');

--使用overwrite，通过linux本地覆盖上传数据，之前的分区就失效了，以新的分区为主，即taskname=word
load data local inpath '/home/hadoop/word' overwrite into table ext_task1 partition(taskname='wordcount')
--因此推荐使用从hdfs方式导入数据

--上传数据的时候如果没有分区，这个分区会自动创建
load data local inpath '/home/hadoop/word' into table ext_task1 partition(taskname='wordcount01')

--Hive新特性,3.0会对load更新会生成一个mapreduce，以分桶的方式把数据分到六个文件中
--旧版本是没有办法通过load直接往分桶表中加载数据的
create table teacher(id int,name string) clustered by (name) into 6 buckets;
load data local inpath '/home/hadoop/teacher' into table teacher;

（2）select 加载数据到hive表

--通过select，将select数据覆盖表或分区的语法
insert overwrite table tablename1 [partition (partcol1=val1,partcol2=val2 ... )] [if not exists]] 
select_statement1 from from_statement;
--通过select，将select数据追加到表或分区的语法
insert into table tablename1 [partition (partcol=val1,partcol2=val2...)]
select_statement1 from from_statement;

--示例:
--添加分区
--alter table ext_task1 add if not exists partition(taskname='wordcount02') location 'wordcount02';
--加载数据
insert into table ext_task1 partition(taskname='wordcount02') select word,num from word_avro;

13. 动态分区

如果有这样一个需求，从一张不是分区表中查询数据导入到分区表中。如果分区的个数比较多的时候，就需要多次查询导入，比如：

--学生表
create table student(
id int,
name string,
age int
)
row format delimited fields terminated by '\t';
--学生表数据
1    name1    12
2    name2    12
3    name3    13
4    name4    13
5    name5    14
6    name6    14
7    name7    15
8    name8    15
load data local inpath '/home/hadoop/student' into table student;
--学生分区表
create table student_dyna(
id int,
name string
)partitioned by (age int)
row format delimited fields terminated by '\t';
--把学生表里的数据按照年龄导入到学生分区表里
--需要动态分区不用执行太多次
--1. 开启动态分区(hive默认是不开启的，因此正常是静态分区)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--2.直接一条insert创建四个分区
insert overwrite table student_dyna partition(age) select id,name,age from student;

Hive默认是静态分区，在插入数据的时候要手动设置分区，如果源数据量很大的时候，那么针对一个分区就要写一个insert，比如有很多日志数据，要按日期作为分区字段，在插入数据的时候手动去添加分区太麻烦。因此，hive提供了动态分区，动态分区简化了插入数据时的繁琐操作。

14. Hive表数据导出

查询表数据导出到某个文件( linux本地/hdfs )

--语法(只能用overwrite，不加local会将select数据写入到hdfs文件)
insert overwrite [local] directory directory1
[row format row_format] [stored as file_format] select ... from ...
--示例
--将select的数据写入到linux本地文件中，多层目录会自动创建
select * from ext_task1 where taskname='wordcount01';
insert overwrite local directory '/home/hadoop/hive_test/output1' row format delimited fields terminated by '|' stored as orc select * from ext_task1 where taskname='workcount01';
--注: 指定存储格式为orc，则指定的分隔符会失效，因为orc有自己的分隔符

15. Hive表多文件导出数据（普通表和分区表都可以）

--语法:
from from_statement
insert overwrite [local] directory directory1 row_format
select_statement1 where
insert overwrite [local] directory directory1 row_format
select_statement2 where 

--示例: 从ext_task1分区表查询不同分区中的数据，分别导入到不同的文件系统，一个是linux本地，一个是hdfs
from ext_task1
insert overwrite directory 'hdfs://ns1/user/cz/output_avro5' stored as orc
select word,num where taskname='wordcount01'
insert overwrite local directory '/home/hadoop/hive_test/output_avro4' stored as orc
select word,num where taskname='wordcount02';

说明:

（1）如果不指定文件存储格式，则导出到文件系统的数据都序列化成默认的textfile，普通字段会进行正常的读取，非原始类型字段（复杂字段map、array）会序列化成json，导出文件以^A(\001)分隔 \n结尾的文本数据。

（2）insert overwrite到hdfs目录，可以通过MR Job实现并行写入，这样在集群上抽取数据不仅速度块，而且还很方便。

（3）批量导入多个文件，需要导出文件的类型一致，如果一个是avro，一个是textfile，则会报错

16. Hive -e -f 参数使用

在单独执行hive脚本的时候，会启动hive的一个客户端，在hive客户端里边执行hql语句。但有时候不想打开hive的客户端执行，想在linux的bash环境下执行相关的一些hql语法，就需要通过hive -e或者hive -f这种方式执行

--本质上启动了hive客户端，然后执行多条sql语句，把结果放到文件中
hive -e "use yae; select * from ext_task1 where taskname='wordcount01'" > ext_task.out

--想在后台运行，正确结果放到output.log文件，标准错误输出放大err.log，&代表后台运行
nohup hive -e "use yae; select * from ext_task1 where taskname='wordcount01'" 1> output.log 2> err.log &
--查看后台，是否有程序在运行
jobs -l

--如果hql语句较多的情况下，在字符串中放不下或者看起来乱，就可以把它写入在文件中
vi hql.log
use yae;
select * from student where age=15;

hive -f hql.log >student.out
--本质相同，启动了hive客户端，然后执行文件中多条sql语句，将标准输出写入到student.out

摇篮里的小樱桃

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive表SQL操作

Hive默认是静态分区，在插入数据的时候要手动设置分区，如果源数据量很大的时候，那么针对一个分区就要写一个insert，比如有很多日志数据，要按日期作为分区字段，在插入数据的时候手动去添加分区太麻烦。（1）Hive的数据加载不会对本地数据文件做任何处理，只是将文件或目录中的所有文件拷贝到表定义的数据目录，分桶表使用load加载数据会生成mapreduce任务，将数据分到多个桶文件进行存放。（5）如果指定overwrite会覆盖相应表数据或分区数据，相当于rm原有目录数据，然后上传新数据文件。
复制链接

扫一扫