hiveQL

最新推荐文章于 2024-06-30 16:21:01 发布

孙喔喔的gorilla

最新推荐文章于 2024-06-30 16:21:01 发布

阅读量799

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/qq_37113621/article/details/84035264

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

create table student(
id string,
name string
) row format delimited fields terminated by ‘\t’;

将本地文件加载到表中
load data local inpath ‘/usr/local/data.txt’ into table student;
将hdfs文件加载到表中
load data inpath ‘/haha.log’ into table student;

hive/bin目录下hiveserver2 开启服务
使用beeline进入
!connect jdbc:hive2://localhost:10000

删除表
drop table users;
清空表数据
truncate table tablename;
创建外部表
create external table exuser(id int, name string)
row format delimited fields terminated by ‘\t’;
查看当前使用的库下
select current_database();

内部表和外部表的区别

创建内部表
create table user(id int, name string)
row format delimited fields terminated by ‘\t’;
创建外部表
create external table exuser(id int, name string)
row format delimited fields terminated by ‘\t’
location ‘/dbdata/’;
(路径是hdfs上的文件夹路径)

1.通过external 关键字进行区分
2.因为外部表更多用于引用外部的数据，也不希望对原始数据进行破坏，所以要配合location来指定外部表要读取数据的位置
3.删除内部表或者清空都会将数据真正的删除，而如果是外部表，只会将表的关联信息删除掉，对指定位置下的信息没有影响
内部表也能使用location
load data inpath 命令 (从hdfs上进行数据导入就会将我们的数据移动到表所使用的目录下)

分区表

相当于在表中又分了小表
作用是可以减少我们有一定条件时的查询数据量，提高查询效率
分区不是只能有一层，可以有多层分区，即小表中又套小表
create table gmjTime(id int, name string)
partitioned by(year string,month string)
row format delimited fields terminated by ‘\t’;

创建好分区后要进行数据的导入在这里要指定好将数据导入到哪个分区，否则不知道数据文件要放在哪里
比如两个分区的话，都需要指定
load data inpath ‘/dbdata/women.log’ [overwrite] into table gmjTime partition (year=‘2018’,month=‘12’);要确保文件已经都是这个分区需要的

insert into birthdays partition(month=‘01’) select id,name,month from persons where month =‘01’; 如果并行条件太多比如按天分区，这样一个分区一个分区的处理太累了，建议使用动态分区
分区表的信息查看
show partitions 表名;

动态分区

使用场景
当我们想对数据进行分区的时候，你能拿到的数据未必是已经分好区的文件，并不能直接load进来就能用
这时候就经常使用动态分区来解决这种问题
| persons.id | persons.name | persons.month | persons.gender |
±------------±--------------±---------------±----------------±-+
| 1 | caonima | 02 | man |
| 2 | ruizhi | 03 | man |
| 3 | nmsl | 04 | man |
| 4 | lcktienaocan | 12 | man |
| 5 | guerlck | 12 | man |
| 7 | nvshen | 03 | women |
| 8 | nvshengjing | 09 | women |
| 9 | guigui | 12 | women |
| 10 | caonima | 07 | women |
| 11 | gunbiduzi | 112 | women |
如上文件，把数据按月份分区加入表中
1.将混乱的数据传到一个表中
2.创建对应的分区表create table birthdays(id int,name string) partitioned by(month string) row format delimited fields terminated by ‘\t’;
3.向分区表动态分区的插入数据
insert into birthdays partition(month) select id,name,month from persons;

3.1如果要进行动态分区就需要将严格模式取消
set hive.exec.dynamic.partition.mode=nonstrict;
3.2而且不要在partition(month)中给分区设置固定值，不然就不是动态分区了
3.3使用动态分区会将查询结果集的最后一个作为分区条件，所以select查询要注意

分桶表

设置可以分桶
set hive.enforce.bucketing=true;
创建分桶表
create table mytable(
id int,
name string)
clustered by(id) sorted by(id DESC) into 2 buckets
row format delimited fields terminated by '\t’stored as textfile;

分组表

1.使用分组就不再是逐行执行模式了，会根据我们的分组得key以分组模式执行
所以数据都是一组一组的，没办法调用除了分组key和聚合函数之外的单独属性
2.使用聚合函数得到的结果默认的字段名不好使，可以取别名，就可以在条件中使用了
同一层having使用，作为子查询是可以给外层的where用
3.where执行时间在聚合之前进行数据筛选，而having实在聚合结束之后继续条件处理

区别

order by：会对输入做全局排序，因此只有一个reducer，只有一个 reduce task的结果，比如文件名是000000_0，会导致当输入规模较大时，需要较长的计算时间
sort by：不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则 sort by只保证每个reducer的输出有序，不保证全局有序
distribute by：根据指定的字段将数据分到不同的reducer，且分发算法是hash散列
Cluster by：除了具有Distribute by的功能外，还会对该字段进行排序。因此，如果分桶和sort字段是同一个时，此时clustered by = distribute by + sort by如果我们要分桶的字段和要排序的字段不一样，那么我们就不能适用clustered by。分桶表的作用：最大的作用是用来提高 join 操作的效率

数据类型

原子数据类型

tinyint 相当于java中的byte
smallint 相当于java中的short
int
bigint 相当于java中的long
float
double
boolean
string

复杂数据类型

array数据类型创建表
create table movies(name string,actors array,times date)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
用下标来取值
map数据类型创建表
create table maps(id int,name string,family map<string,string>,age int)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;
用[‘key’]来取值
结构体数据类型
create table struct (id int,name string,info struct< age:int,gender:string,city:string>)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;
用点(.)来获取如info.gender