Hive笔记

最新推荐文章于 2024-09-13 20:54:13 发布

失败的Cc

最新推荐文章于 2024-09-13 20:54:13 发布

阅读量177

点赞数

分类专栏： Hive 文章标签： hive big data hadoop

本文链接：https://blog.csdn.net/DreamMaker_DI/article/details/120401033

版权

Hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Hive

表和库的管理

创建库
```
create database db_hive;
```

创建表

# 格式
create [external] table [if not exists] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[comment table_comment] 
[partitioned by (col_name data_type [comment col_comment], ...)] 
[clustered by (col_name, col_name, ...) 
[sorted by (col_name [ASC|DESC], ...)] into num_buckets BUCKETS] 
[row format row_format] 
[stored as file_format] 
[location hdfs_path]

# 创建普通表
create table if not exists student2(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';

# 创建外部表
create external table if not exists default.dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
# 导入数据到外部表
load data local inpath '/opt/module/datas/dept.txt' into table default.dept;

# 创建一级分区表,每个分区在HDFS是分目录存储
create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';
# 导入外部数据到分区表
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition（month='201707');

# 创建二级分区表
create table dept_partition2(
deptno int, dname string, loc string
)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';
# 导入外部数据到分区表
load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition2 partition(month='201709', day='13');

create table创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。
external关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。
comment：为表和列添加注释。
partitioned by创建分区表
clustered by创建分桶表
sorted by不常用
row format

用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED，将会使用自带的SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过SerDe确定表的具体的列的数据。

SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化。

stored as指定存储文件类型

常用的存储文件类型：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）

如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。

location：指定表在HDFS上的存储位置。
like允许用户复制现有的表结构，但是不复制数据。

查询库和表

show databases like 'db_hive*'
show tables;

desc database db_hive;
desc database extended db_hive;

# 查询表时管理表还是外部表
desc formatted student2;

修改数据库和表的属性

# 修改数据库的属性
alter database db_hive set dbproperties('createtime'='20170830');

# 将管理表转换为外部表
alter table student2 set tblproperties('EXTERNAL'='TRUE')

# 将外部表转换为内部表
alter table student2 set tblproperties('EXTERNAL'='FALSE')

删除表和库

drop database db_hive;

# 如果数据库不为空，采用cascade
drop database db_hive cascade;

# 删除管理表
truncate table student;

查询表中的数据

# 查询单个分区的内容
select * from dept_partition where month = '201709';

# 查询多个分区的内容
select * from dept_partition where month='201709'
union
select * from dept_partition where month='201708'
union
select * from dept_partition where month='201707';

# 查询二级分区表中的内容
select * from dept_partition2 where month = '201709' and day = '13';

增加分区

# 增加单个分区
alter table dept_partition add partition(month = '201706');

# 同时创建多个分区
alter table dept_partition add partition(month = '201706') partition(month = '201704')

删除分区

# 删除单个分区
alter table dept_partition drop partition(month = '201704');

# 删除多个分区
alter table dept_partition drop partition(month = '201705') partition(month = '201706');

修改表名

alter table table_name rename to new_table_name;

导入数据
```
load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)]
```
- load data:表示加载数据
- local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
- inpath:表示加载数据的路径
- overwrite:表示覆盖表中已有数据，否则表示追加
- into table:表示加载到哪张表
- student:表示具体的表
- partition:表示上传到指定分区

自定义函数

UDF（User-Defined-Function）：一进一出

创建一个类，继承UDF类，并实现evaluate方法

import org.apache.hadoop.hive.ql.exec.UDF;

public class Lower extends UDF {
	public String evaluate (final String s) {
		if (s == null) {
			return null;
		}
		return s.toLowerCase();
	}
}

打成jar包上传到HDFS
将jar包添加到Hive的classpath，hive (default)> add jar /opt/module/datas/udf.jar;
创建临时函数，并与开发好的class关联，hive (default)> create temporary function mylower as "com.atguigu.hive.Lower";
在Hive中使用创建好的函数，hive (default)> select ename, mylower(ename) lowername from emp;

UDAF（User-Defined Aggregation Function）：聚集函数，多进一出

UDTF（User-Defined Table-Generating Functions）：一进多出

窗口函数和排名函数

格式

# 函数的格式
函数 over( partition by 字段 ,order by 字段  window_clause )

# window_clause的格式
(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following

# 窗口函数
rank: 允许并列，一旦有并列跳号！ 
row_number: 行号！ 连续的，每个号之间差1！
dense_rank： 允许并列，一旦有并列不跳号！
cume_dist：  从排序后的第一行到当前值之间数据 占整个数据集的百分比！
precent_rank：  rank-1/ 总数据量-1   
ntile(x):  将数据集均分到X个组中，返回每条记录所在的组号

数据集

name    orderdate       cost
jack    2017-01-01      10
tony    2017-01-02      15
jack    2017-02-03      23
tony    2017-01-04      29
jack    2017-01-05      46
jack    2017-04-06      42
tony    2017-01-07      50
jack    2017-01-08      55
mart    2017-04-08      62
mart    2017-04-09      68
neil    2017-05-10      12
mart    2017-04-11      75
neil    2017-06-12      80
mart    2017-04-13      94

小知识

排名函数和LAG,LEAD不支持指定窗口大小
在over()中既没有出现windows_clause，也没有出现order by，窗口默认为rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING,即从上边界到下边界
在over()中(没有出现windows_clause)，指定了order by，窗口默认为rows between UNBOUNDED PRECEDING and CURRENT ROW，从上边界到当前记录
窗口函数和分组的区别
- 如果是分组操作，select后只能写分组后的字段
- 如果是窗口函数，窗口函数是在指定的窗口内，对每条记录都执行一次函数
- 如果是分组操作，有去重效果，而partition不去重！

聚合过程

聚合函数

# 输出每个人的信息以及总共有多少个人
select name,orderdate,cost, count(*) over()
from business

数据集

name    orderdate       cost
jack    2017-01-01      10
tony    2017-01-02      15
jack    2017-02-03      23
tony    2017-01-04      29
jack    2017-01-05      46
jack    2017-04-06      42
tony    2017-01-07      50
jack    2017-01-08      55
mart    2017-04-08      62
mart    2017-04-09      68
neil    2017-05-10      12
mart    2017-04-11      75
neil    2017-06-12      80
mart    2017-04-13      94

例题

（1）查询在2017年4月份购买过的顾客及该顾客购买的次数

select name, count(*) over()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;

# 等价于

select name, count(*) over(rows between unbounded preceding and unbounded following)
from business
where substring(orderdate, 1, 7)='2017-04'
group by name;

（2）查询顾客的购买明细及每个月的购买总额

select name, orderdate, cost, sum(cost) over(partition by name, substring(orderdate, 1, 7))
from business;
如果只需要查询每个人每个月的购买总额

select name, substring(orderdate, 1, 7), sum(cost)
from business
group by name, substring(orderdate, 1, 7);

（3）查询顾客的购买明细要将cost按照日期进行累加(over中有order by时默认是从窗口顶部到当前行)

select name, orderdate, cost, sum(cost) over(partition by name order by orderdate)
from business;

（4）查询顾客的购买明细及顾客上次的购买时间

select name, orderdate, cost, lag(orderdate, 1, '无数据') over(partition by name order by orderdate)
from business;

(5) 查询顾客的购买明细及顾客下次的购买时间

select name, orderdate, cost, lead(orderdate, 1, '无数据') over(partition by name order by orderdate)
from business;

(6) 查询顾客的购买明细及顾客本月第一次购买的时间

select name, orderdate, cost, first_value(orderdate, true) over(partition by name, substring(orderdate, 1, 7) order by orderdate)
from business;

(7) 查询顾客的购买明细及顾客本月最后一次购买的时间

select name, orderdate, cost, last_value(orderdate, true) over(partition by name, substring(orderdate, 1, 7) order by orderdate rows between current ROW and unbounded following)
from business;

(8)查询顾客的购买明细及顾客最近三次cost花费

当前行的和前两次的和
select name, orderdate, cost, sum(cost) over(partition by name order by orderdate rows between 2 preceding and current row)
from business;

当前行和前一行和后一行的和
select name, orderdate, cost, sum(cost) over(partition by name order by orderdate rows between 1 preceding and 1 following)
from business;

(9) 查询前20%时间的订单信息

select *
from
(
   select name, orderdate, cost, cume_dist() over(order by orderdate 
) cdnum
from business) tmp
where cdnum <= 0.2;

数据集

name    subject score
孙悟空  数学    95
宋宋    数学    86
婷婷    数学    85
大海    数学    56
宋宋    英语    84
大海    英语    84
婷婷    英语    78
孙悟空  英语    68
大海    语文    94
孙悟空  语文    87
婷婷    语文    65
宋宋    语文    64

例题

（1）将每个版本进行排名

select name, subject, score, rank() over(partition by subject order by score desc)
from score;

（2）给每个学生计算总分后进行排名

select name, sumscore, rank() over(order by sumscore desc)
from
(
  select name, sum(score) sumscore
  from score
  group by name
) t;

（3）求每个学生的成绩明细及对应学生的总分和总分排名

select name, subject, score, dense_rank() over(order by tmp.sumscore desc)
from
(
  select name, subject, score, sum(score) over(partition by name) sumscore
  from score
) tmp;

（4）查询每个科目的前2名

select name, subject, score
from 
(
  select name, subject, score, rank() over(partition by subject order by score desc) rn
  from score
) tmp
where rn <= 2;

（5）查询每个学生成绩明细，并显示当前科目的最高分

select name, subject, score, max(score) over(partition by subject)
from score;



select name, subject, score, first_value(score) over(partition by subject order by score desc)
from score;

（6）查询学生成绩，并显示当前科目最低分

select name, subject, score, min(score) over(partition by subject)
from score


select name, subject, score, first_value(score) over(partition by subject order by score )
from score