Hive简介及常用操作

最新推荐文章于 2023-05-04 16:17:38 发布

SeafyLiang

最新推荐文章于 2023-05-04 16:17:38 发布

阅读量742

点赞数 2

分类专栏：大数据文章标签：数据分析 big data hive 数据挖掘大数据

本文链接：https://blog.csdn.net/SeafyLiang/article/details/121424579

版权

大数据专栏收录该内容

6 篇文章 2 订阅

订阅专栏

一、Hive简介

1.1 什么是hive

facebook开源用于解决海量结构化日志的数据统计工具
基于hadoop的一个数据仓库工具，可以将结构化数据文件映射成为一张表，并提供类SQL查询功能。
hive的本质是把HQL(hive sql)解析为MapReduce作业。还支持以Tez和Spark作为执行引擎。

1.2 为什么要使用hive

hive的出现的主要原因是因为hadoop的MapReduce代码通常都很复杂，而sql语法相对于MapReduce代码语言要更加简洁，学习成本更低。

1.3 hive的优缺点

优点：

采用sql作为开发语言，学习成本低，开发简洁。
支持大量分析函数，如开窗函数等，所以适合作为olap工具。
基于hdfs分布式存储和yarn分布式计算，有很好的横向扩展性。
支持海量数据的分析计算，在小数据量的情况下并无优势。
支持自定义函数，在系统默认函数无法解决时，可以采用自定义函数处理。

缺点

hive基于MapReduce作为默认执行引擎，执行效率低。
sql表达能力有限，某些复杂功能，需要使用代码实现。

二、hive和greenplum

1.1 区别

hive	greenplum
查询速度	慢
存储	hdfs
计算	默认MR
事务	不支持
数据量	超大规模数据量

1.2 常用数据类型对比

greenplum	hive
int2	SMALLINT
int4	INT
int	BIGINT
float4	FLOAT
float8	DOUBLE
numeric	DECIMAL
varchar	STRING
bpchar	STRING
text	STRING
boolean	BOOLEAN
date	date
timestamp	timestamp

1.3 hive复杂数据类型介绍

数据类型	描述	示例
STRUCT	一种嵌套结构，内部数据可以通过.来获取。	structstreet:string,city:string
MAP	是一个key-value类型的。通过[‘key’]获取元素	map<string,int>
ARRAY	数组类型。通过[index]获取元素	array

三、常见用法

3.1 DDL

3.1.1 创建数据库

CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

3.1.2 创建一个数据库

# 默认存储路径是hdfs上的/user/hive/warehouse/*.db
create database db;
# 指定hdfs存储路径/data/warehouse/db/
create database db location '/data/warehouse/db/';

3.1.3 查看数据库

hive (liker)> desc database db;
OK
db_name	comment	location	owner_name	owner_type	parameters
db		hdfs://hadoop001:9820/data/warehouse/db	liker	USER

3.1.4 显示数据库详细信息

hive (liker)> desc database extended db;
OK
db_name	comment	location	owner_name	owner_type	parameters
db		hdfs://hadoop001:9820/data/warehouse/db	liker	USER	{crtdate=2021-11-11}

3.1.5 切换数据库

# 切换到db数据库
hive (liker)> use db;
OK
Time taken: 0.013 seconds
hive (db)>

3.1.6 常用建表语句

CREATE EXTERNAL TABLE dms_table (
    `id` STRING COMMENT 'ID',
    `wnumber` STRING COMMENT 'SN',
    `itemcode` STRING COMMENT '整车物料编码',
    `itemname` STRING COMMENT '整车物料描述',
    `series_code` STRING COMMENT '产品系列编码'
) COMMENT '电摩主部件'
PARTITIONED BY (`dt` STRING COMMENT '时间分区',`rgn` STRING COMMENT '大区') 
ROW FORMAT DELIMITED FIELDS TERMINATED by '\001' 
STORED AS parquet 
LOCATION '/data/hive/warehouse/db/dws/dms_table' TBLPROPERTIES ('parquet.compression' = 'SNAPPY');

3.2 常用关键字/语法

关键字/语法	描述
EXTERNAL	指定表为外部表，默认为内部表,通常指定为外部表
PARTITIONED BY	指定分区字段
ROW FORMAT DELIMITED FIELDS TERMINATED by	指定字段分隔符
STORED AS	指定存储格式，默认为text文本。通常使用orc/parquet列式存储
LOCATION	指定表的数据在HDFS上的存储路径
TBLPROPERTIES	设置表的配置，例如配置压缩格式、内部表/外部表。

3.3 内部表/外部表

hive的数据是存储在hdfs上的，内部表和外部表的差别就是，当删除表的时候，hdfs的文件会不会被删除。
由于删除内部表的时候，数据也会被删除。为了数据安全通常正式建表，都是创建外部表，并且指定外部表存储路径。
内部表通常用于测试使用，用完删了即可，数据也会被删除，不会占用集群存储空间。例如，在数据校验时，作为对比的表，就可以创建内部表，当校验结束后，把表删除即可。

# 创建内部表
create table table_name ...
# 创建外部表
create external table table_name ...
# 内部表和外部表转换
alter table table_name set tblproperties('EXTERNAL'='TRUE');
# 通过desc查看
desc formatted table_name;
......
	EXTERNAL            	TRUE   
......

3.4 分区

分区的作用是把数据分开存储，在hdfs上将一个表的不同分区的数据分开存储。优点就是在查询的时候，可以通过指定分区来减少查询时底层需要加载的数据量。
例如增量存储的数据中，可以把数据按照天来存储。

0B dt=20211101
0B dt=20211102
0B dt=20211103

3.4.1 创建分区表

# 创建按天分区
create external table partition_table(
...)
partitioned by (day string,rgn string)

3.4.2 分区表使用

① 添加测试数据

create table test_partition (id int, name string)
partitioned by(dt string);
insert overwrite table test_partition partition(dt='20211101') values(3,'hehe');
insert overwrite table test_partition partition(dt='20211102') values(3,'hehe');

② 查看分区

show partitions test_partition;
dt=20211101
dt=20211102

③ 手动添加分区

# 添加分区
alter table table_name if not exists add partition (dt='20211103')
# 添加后，查看分区
show partitions test_partition;
dt=20211101
dt=20211102
dt=20211103

④ 手动删除分区

# 删除分区
alter table table_name drop partition (dt='20211101')
# 删除后查看分区
show partitions test_partition;
dt=20211102
dt=20211103

# 查询分区的数据
# 当只需要某一天或某几天的数据时，可以在where后添加分区的过滤条件
 select * from test_partition where day='20210101'
# or
 select * from test_partition where day='20210101' or day='20210102'

3.4.3 insert into & insert overwrite

insert into：向指定表增量添加数据。通常不会使用
insert overwrite：向指定表添加数据，覆盖之前的数据，通常结合分区使用。

# insert into 示例
create table test_tab (id int, name string);
insert into test_tab values(1,'kwang');
insert into test_tab values(2,'fwang');
select * from test_tab;
1       kwang
2       fwang

# insert overwrite 示例
insert overwrite table test_tab values(4,'zz');
select * from test_tab;
4        hehe

向分区表添加数据，向分区表添加数据，必须指定分区，否则报错

create table test_partition (id int, name string)
partitioned by(dt string);
insert into test_partition partition(dt='20211101') values(1,'kwang');
insert into test_partition partition(dt='20211101') values(2,'fwang');
select * from test_partition ;
1       kwang
2       fwang

# insert overwrite
insert overwrite table test_partition partition(dt='20211101') values(3,'hehe');
select * from test_partition ;
3    hehe

3.4.4 动态分区

以上介绍的分区操作都是在添加的时候指定分区，如果在向某个表添加数据时，需要根据表中的具体数据来添加到指定分区。

添加测试数据

create table test_data (id int, name string,day_date string);
insert into table test_data values(1,'mapreduce','20211101'),(2,'hadoop','20211102'),(4,'hive','20211102'),(3,'spark','20211103');
select * from test_data;
1    mapreduce    20211101
2    hadoop       20211102
4    hive         20211102
3    spark        20211103

添加数据到分区表

# 创建分区表
create table test_partition (id int, name string)
partitioned by(dt string);
# 在hive中动态分区默认是关闭的，使用前需要手动打开，并且关闭严格模式。需要设置如下两个参数
SET hive.exec.dynamici.partition = TRUE;
SET hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table test_partition partition(dt)
select 
id,
name,
day_date as dt
from test_data;
# 验证分区的数据
select * from test_data where dt='20211101';
1    mapreduce    20211101
select * from test_data where dt='20211102';
2    hadoop       20211102
4    hive         20211102
select * from test_data where dt='20211103';
3    spark        20211103

3.4.5 去重操作，避免使用distinct，使用group by

# 避免使用distinct的方式
select distinct name,age,tel from user
# 使用group by
select name,age,tel from user group by name,age,tel
# 同样的count(distinct) 改为group by
select distinct name,age,tel from user
select name,age,tel,count(1) cnt from user group by name,age,tel

3.4.6 常用函数

nvl：接受两个参数，如果第一个为null，取第二个值

select nvl(null,1)
1
select nvl(1,2,)
1

coalesce: 返回第一个不为null的值

select coalesce(null,'1',3)
1

3.4.7 数组操作

# 添加测试数据
create table test_array_data(id int,name string,hoby string);
insert into test_array_data values(1,'hive','hiveserver2,beenline,hadoop'),(2,'hdfs','namenode,hadoop,datanode'),(3,'spark','hadoop,rdd,sql,dataframe,dataset')
drop table test_arr;
select id,name,split(hoby,',') hoby from test_array_data;
create table test_arr(id int,name string,hoby array<string>);
insert into table test_arr select id,name,split(hoby,',') hoby from test_array_data;
select * from test_arr;
1 hive  ["hiveserver2","beenline","hadoop"]
2 hdfs  ["namenode","hadoop","datanode"]
3 spark ["hadoop","rdd","sql","dataframe","dataset"]

select id,name,split(hoby,',') hoby from test_array_data;
# explode
select id,name,h from test_arr lateral view explode(hoby) test_arr as h 
1 hive  hiveserver2
1 hive  beenline
1 hive  hadoop
2 hdfs  namenode
2 hdfs  hadoop
2 hdfs  datanode
3 spark hadoop
3 spark rdd
3 spark sql
3 spark dataframe
3 spark dataset
# size
select id,name,size(hoby) len from test_arr;
id  name len
1 hive  3
2 hdfs  3
3 spark 5
# array_contains
select id,name,array_contains(hoby,'beenline') from test_arr;1 hive  true
2 hdfs  false
3 spark false
# 根据下标获取
select hoby[1] from test_arr;
beenline
hadoop
rdd
# concat_ws 
select id,name,concat_ws('-',hoby) from test_arr;
1 hive  hiveserver2-beenline-hadoop-hadoop
2 hdfs  namenode-hadoop-datanode
3 spark hadoop-rdd-sql-dataframe-dataset

3.4.8 字符串转数组

-- 测试数据 
insert overwrite table test_array_data values(1,'hive','hiveserver2,beenline,hadoop,hadoop'),(2,'hdfs','namenode,hadoop,datanode'),(3,'spark','hadoop,rdd,sql,dataframe,dataset')
insert overwrite table test_arr select id,name,split(hoby,',') hoby from test_array_data;
# 查询数据
with temp as (
select id,name,h from test_arr lateral view explode(hoby) test_arr as h
)select id,name,collect_list(h) h_list,collect_set(h) h_set from temp group by id,name

# collect_list & ,collect_set
id  name  h_list  h_set
1 hive  ["hiveserver2","beenline","hadoop","hadoop"]  ["hiveserver2","beenline","hadoop"]
2 hdfs  ["namenode","hadoop","datanode"]  ["namenode","hadoop","datanode"]
3 spark ["hadoop","rdd","sql","dataframe","dataset"]  ["hadoop","rdd","sql","dataframe","dataset"]
# split 
select id,name,split(hoby,',') hoby from test_array_data;
1 hive  ["hiveserver2","beenline","hadoop","hadoop"]
2 hdfs  ["namenode","hadoop","datanode"]
3 spark ["hadoop","rdd","sql","dataframe","dataset"]

3.4.9 常用日期函数

date_add: 用来为给定的日期增加指定的天数

# 格式 string date_add(string startdate, int days)
# 增加2天
select date_add('2021-10-10',2) # 2021-10-12
# 减去1天
select date_add('2021-10-10',-1) # 2021-10-09

datediff：计算两个日期相差的天数计算时会忽略时分秒

# 格式 int datediff(string enddate, string startdate) 
# 计算2021-10-10 和2021-10-14相差几天
select datediff('2021-10-10','2021-10-14') # -4
select datediff('2021-10-14 00:00:00','2021-10-10 23:59:59') # 4

date_format：指定格式转换格式

# 将yyyy-MM-dd HH:mm:ss 转换为yyyy/MM/dd HH-mm-dd
select date_format('2021-10-10 11:12:11','yyyy/MM/dd HH-mm-ss')
# 2021/10/10 11-12-11

current_date：获取当前日期，年月日

# 获取当前日期
select current_date(); #2021-11-17

last_day：获取指定时间所在月的最后一天

# 获取当前时间所在月的最后一天
 select last_day(current_date()) # 2021-11-30

from_unixstamp：根据时间戳获取指定格式的时间

# 从时间戳
 select from_unixtime(unix_timestamp('2021-11-17 10:11:12'),'yyyy/MM/dd HH-mm-ss') # 2021/11/17 10-11-12

unix_timestamp

# 获取指定时间的时间戳
select unix_timestamp('2021-11-17 10:10-10','yyyy-MM-dd HH:mm-ss') ; # 1637143810

weekofyear

select weekofyear('2021-10-10') # 40

SeafyLiang

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录