扁担学习hive

最新推荐文章于 2024-11-05 15:26:40 发布

扁担来喽

最新推荐文章于 2024-11-05 15:26:40 发布

阅读量353

点赞数 5

文章标签：学习 hive

本文链接：https://blog.csdn.net/hxhxhzjz/article/details/135691710

版权

扁担学hive

hive

hive 并不是数据库，是一种 用于解决 海量的 结构化的 数据统计工具
是一种基于hadoop的一个数仓工具，可以将结构化数据文件映射为一张表，并提供类SQL查询功能
本质上是
	1、将hive SQL 转为 mr或spark程序(看底层用的是那种引擎) 
	2、将程序提交到yarn上 
	3、对存储在hdfs上的文件进行处理

hive的执行流程

1、客户端通过driver连接到hive服务，将SQL提交
2、hive访问元数据库mate store(元数据库中存储着元数据信息：表名、表所属的数据库、表的拥有者、列/分区字段、表的类型(内部表还是外部表)、表的数据所在的目录等)
元数据信息默认存储在hive自带的Derby数据库中，一般使用mysql进行存储
3、hive服务将SQL进行处理
	1) 通过SQL解析器，解析SQL,将SQL字符串抽象为语法树ast,对ast进行语法分析
		eg：表、字段是否存在，SQL语法是否有误
	2) 通过SQL编译器 对ast编译生成逻辑执行计划
		from -> join on -> where -> group by ->聚合函数 -> having -> select -> distinct ->
			order by -> limit
	3) 通过SQL优化器 对执行计划进行优化
	4) 通过SQL执行器 将逻辑执行计划转为可执行的物理计划，如hive SQL 转为可行行的 mr或spark 
4、将生成的可执行计划提交到yarn上
5、yarn通过调度资源执行计划

hive的数据类型

基本数据类型

hive数据类型			java数据类型		
tinyint                  byte
smalint				     sort
int						 int
bigint					 long
float      				 float
double					 double
string					 String
boolean       			 boolean
timestamp                时间戳类型
date                     日期类型
binary                   字节数组

隐式类型转换
tinyint/smalint/int/bigint/string可以隐式转换为 double 类型 
强制类型转换
cast(数据/字段 as 数据类型)

复杂数据类型

数据类型    comment               	
array    有序的同类型集合          
map       k-v,k必须为基本类型v可以是任意类型 
struct     字段集合，类型可以不同

集合数据类型在创建表时：
	create table if not exists table_name(
		arr array<string>, # 数组格式
		map map<string,string>, # map 格式
		struct struct<列1:string,列2:string> # struck格式
	)row format delimited fields terminated by ','
	collection items terminated by '_'
	# 集合(数组、集合、struct) 多个元素之间以 '_'分割，所有的该类型的数据格式都必须以此
	map keys terminated by ':'
	# 指明 map 中的k v 是以':' 进行分割
	lines terminated by '\n'
	# 行数是以换行符进行分割

获取复杂类型中属性的方法
	数组: 使用索引 字段名[index]
	map: 使用key值 字段名[key]
	struct: 使用 字段名.属性值 和c++的结构体很像

DDL

创建数据库 location 是在hdfs上的位置
	create database if not exists 数据库名 location '';
删除数据库 
	drop database if exists 数据库名;

# 创建表
	create [external] table if not exists 表名(
		字段名 字段类型 comment ''
	)
	partition by # 分区字段
	clustered by # 分桶字段
	sorted by    # 不常用,对桶中的一个或多个列另外排序
	row format delimited fileds terminted by '\t' # 行的分隔符
	stored by # 文件在hdfs上的存储格式
	location hdfs_path # 指定表在hdfs上的存储位置 默认当前库下
	tblproperties (property_name=property_value, ...) 
	as select_statement # 建表是加载数据通过查询语句

# 查看表的具体信息
desc formatted 表名;

external
	该字段修饰的表为 外部表
		内部表：管理表 
			hive认为内部表归hive管理，拥有表对应数据文件的管理权限
			当删除管理表时，hdfs中的数据和mysql中的元数据都被删除 # 可以控制表的生命周期
			
		外部表 
			hive认为外部表归hdfs，hive不拥有表对应数据文件的管理权限
			在删除管理表时，hdfs中的数据不会被删除，mysql中的元数据会被删除 # 不能控制表的生命周期
内部表外部表相互转换
	alter table student2 set tblproperties('EXTERNAL'='TRUE');
											# TRUE 外部表
											# FALSE 内部表

DML

数据导入

1、load data [local] inpath '数据路径' [overwrite] into table 表名 [partiiton 分区字段=v1...]

local: 使用 写linux本地路径
	   为使用 写 hdfs路径
overwrite: 覆盖写

eg:
	本地linux路径
	load data local inpath '/opt/hzy/hive' into table stu;
	hdfs路径
	load data inpath '/hzy/stu1.txt' into table stu;

通过查询语句向表添加数据

1、 insert into table 表名 select ... from 表名; # 追加 不覆盖 原数据不会丢失
2、 insert overwrite table 表名 select ... from 表名; # 覆盖
3、 insert into table 表名 select ... from 表名 partition (分区字段=v); # 分区插入

创建表时

建表是通过查询语句加载数据
	建表语句 + as + 查询语句
	
建表是使用 location的方式
	建表语句 + location + 'hdfs路径'
	只能是hdfs路径，路径为目录时，表示加载该文件下的所有文件

import方式

使用前提：必须使用export的方式导出以后(导出了元数据和真实数据)， 在使用import进行导入
	eg:
	import table 表名 form '/user/hive/warehouse/export/表名'

数据导出的方式

insert 
	将查询结果导出到本地 会覆盖本地路径下所有的文件
	insert overwrite local directory 'local_path' select ... from 表名;
	将查询的结果结构化导出到本地
	insert overwrite local directory 'local_path' 
	row format delimited fields terminated ','
	select ... from 表名;
    将查询的结果导出到hdfs上(去掉local 路径改为 hdfs的路径 没有路径会创建)
    insert overwrite directory 'hdfs_path'
    row formated delimited fields terminated by ','
    select ... from 表名;

hadoop 命令导出到本地
	hdfs dfs -get /user/hive/warehouse/study.db/student/stu1.txt /opt/hzy/hive/

hive 的shell命令
	hive -e/-f "查询语句(库.表);..."/(sql脚本或文件名) > 重定向文件
	eg：
		hive -e "select * from stydy.stu;select * from study.people1;..." > /opt/hzy/hive

export导出到hdfs(只能是hdfs路径) 会产生两个文件 data 数据包 _metadata 元数据包
	export table 库名.表名 to '/hzy/hive'
export 和 import 用于不同hadoop集群的数据迁移
在对export 导出的数据进行 import的时候 ，可以只导入 data数据包 可以重新指定分隔符

查询

like 和 Rlike

1） like
% : 代表零个或者是多个字符（即时任意字符）	
_ : 代表一个字符；
\ :  转义字符；

2） Rlike ：后面紧跟随正则表达式
 \ : 转义字符，即屏蔽特殊字符的含义：\$;
 ^ : 从头开始匹配，如：name  rlike ^a  : 表示以a开头的name
 $ : 匹配结尾 ，如  name Rlike t$ ：匹配以t结尾的name
 * ： 0-n 个 ，如 name rlike a* : 匹配 0-n a的name
[] : 表示范围，如 [0-9,a-z]:匹配0-9或者是a-z都可以。

排序

hive中有四种排序方式
1、order by
	全局排序，只能有一个reduce
	desc 降序  asc 升序默认
2、sort by
	类似在reduce中进行排序，所以一般需要多个reduce ，在每个reduce中进行排序，属于局部排序，而不是全局排序(局部有序) 
	场景： 当数据量很大时，不需要进行全局排序，只需进行局部排序
	一般不单独使用，因为无法控制什么样的数据进入到同一个reduce中
	一般配合distribute by 使用，分区排序就是指定什么样的数据会进入同一个reduce中
	单独使用时，进入同一个 reduce 任务中的数据是随机的 就是每次计算的结果是一样的，但是进入每一个reduce 中的数据是随机的
# 设置reducer的个数
set mapreduce.job.reduces1=3;
3、distribute by
类似在 MapReduce 中的自定义分区(partition)
	一般就是配合 sort by 使用
	在使用的时候，不能是一个reduce，需要多个reduce
 ###什么样的数据会进行同一个reduce 呢 	
首先，这个分区不是很智能，使用的方式是：分区的字段的 （ hashcode  % reduce的个数 ），计算值相等的，则进入同一个reduce；不会使用toString方式进行分区。
  	distribute by 必须写在sort by 的前面	
4、cluster by
	当distribute by 和 sort by 的字段相同时，可以使用Cluster by 进行替代

分区

hive的分区就是分目录
分区表对应的是一个hdfs文件系统的独立文件
实际上是把一个打的数据集根据业务的需求分割成多个小的数据集
在查询时，通过where语句根据条件筛选，指定数据在那个分区内，提高查询的效率
同时用于解决数据倾斜

静态分区

创建分区表

create table if not exists 表名(
	name string, 
	gender string
)partition by (id int,...) # 分区字段不能在 () 中出现，但可以作为查询字段
row format delimited fields terminated by ',';

加载方式

1、load 方式
load data local inpath '本地路径' into table 表名 partition (分区字段=分区值,...)


2、上传数据后修复 -- 因为单独上传数据到指定的目录下，hive是不能自动读取，需要进行数据的修复
第一步： 根据分区字段的信息，创建文件夹，此文件夹与表的路径相同
第二步： 本地的数据上传到指定的目录下，使用 【 hdfs dfs -put 本地数据路径 hdfs文件路径 】
第三步： 进行数据的修复 ，使用语句 【msck repair table 表名】

3、 上传数据后添加分区的方式 -- 该方式使用的情况最多
第一步和第二步与方式二完全相同；
第三步： 执行添加分区的方式
alter table 表名 add partition (字段1='***',字段2='***')

-- 实例：
第一步：hdfs dfs -mkdir -p  /user/hive/warehouse/dept_partition/month=2020-04/day=04 ;
第二步：hdfs dfs -put /opt/module/hive/datas/2020-04-04.logs /user/hive/warehouse/dept_partition/month=2020-04/day=04
第三步：
方式二： msck repair table dept_partition;
方式二： alter table dept_partition add partition (month='2020-04',day='04');

动态分区

在实际情况中，我们的数据来源一般是通过采集等一些方式收集过来之后，一把存储在hdfs上，或者落库，
我们只需要通过将源表数据通过一些SQL手段进行处理后导入到指定数据表，并指定按照某个字段进行分区

设置
# 将hive设置为非严格模式 在严格模式下，分区必须至少有一个是静态字段
set hive.exec.dynamic.partition.mode=nonstrict

eg：
	create table if not exists emp(
		name string,
		gender string
	)partition by(dept int)
	row format delimited fields terminated by ',';

insert into table emp partition (dept)
select 
	name,
	gender,
	dept
from emp1;

分桶

### 分桶表针对的数据文件，将同桶的数据写入一个文件
### 分区表针对的是数据路径，将同区的数据写入一个文件夹下
为什么分桶？
1、提供一个数据隔离和优化查询的方式，当一个表的量集特别大的时候，可以通过分桶将数据分成多个模块，在进行查询，可以提高查询的效率
2、在进行join的时候，数据量特别大，对两表的数据进行分桶，那这两个表进行join的次数会大大减少，提高两表的join的效率
3、分桶的规则
	分桶字段.hashcode % 桶的个数，取模相同的进入统一个桶中 (相当于 mr的reduce阶段的分区，将数据分别写入对应的桶中，因此不是用于spark(TEZ引擎))

eg：
	create table if not exists emp(
		name string,
		gender string,
		dept int
	)clustered by(dept,...) into 3 buckets # dept 等字段必须来自表中 3 buckets 代表数据分为 3桶
	row format delimited fields terminated by ',';

抽样查询

当数据量特别庞大的时候，想要了解数据，不需要通过查询全表数据来获取数据的情况
可以通过抽样查看
eg:
	select id,name,gender from emp tablesample (bucket x out of y on id);
	on:表示依据那个字段进行抽样
	y：表示将根据按照on后面的字段分为y份儿
	x：表示只要第x份
注：
	必须满足 x <= y
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

函数

常用函数

日期类

1、unix_timestap #返回当前或指定时间的时间戳
2、from_unixtime # 将时间戳转为日期格式
3、current_date # 当前日期
4、current_timestamp # 当前日期 + 时间
5、to_date # 获取日期部分
6、year/month/day/hour/minute/second # 获取年、月、日、小时、分钟、秒
7、weekofyear # 当前时间是一年中的第几周
8、dayofmonth # 当前时间是一个月中的第几天
9、months_between #连个日期间的月份
10、datediff # 两个日期相减的天数 前面减   后面的
11、add_months # 日期加减月
12、date_add # 日期加天
13、date_sub # 日期减天
14、last_day # 日期的当月的最后一天
15、date_format # 日期格式化  yyyyMMdd HH:mm:ss

取整函数

1、round # 四舍五入
2、ceil # 向上取整
3、floor # 向下取整

字符串函数

1、regexp_replace(str1,regex,str2) # 字符串替换函数 使用正则
	str1 需要操作的字符
	regex 正则表达式
	str2 需要被替换成的字符
2、length  # 字符串长度

判空函数

nvl(字段,默认值)  字段值为null 时用默认值替换

集合数据类型函数

1、size # 集合中元素个数
2、map_keys # 返回map中的key  返回值类型是数组  
3、map_values # 返回map中的value 返回值类型是数组
4、array_contains # 判断array中是否包含某个元素
5、sort_array # 将array中的元素排序

炸裂函数

explode 将集合中的元素炸开成多行数据
explode() 小括号中的字段必须是map或者array类型(split函数对字段进行切割之后的数据类型是数组)
需要和 lateral view (侧窗口函数)结合使用
eg:
lateral view
explode(students) student_view as student_name
[注]student_view 为 lateral view 形成的虚拟表的表名
	student_name 为 explode 炸裂出来的列的字段名


eg1: 308 doit46 ["hzy","wxc"]
	结果集 308 doit46 hzy
		  308 doit46 wxc
select 
	id,
	name,
	student_name
from student
lateral view
explode(students) student_view as student_name


eg2: 速度与激情 动作,惊悚,犯罪,悬疑
结果集： 速度与激情 动作
  	    速度与激情 惊悚
  	    速度与激情 犯罪
  	    速度与激情 悬疑
	
select name,
       c_name
from movie
lateral view
explode(split(category,',')) cate as c_name

行转列

将分组后的某个字段的所有数据 搜集到一行 ，是一个数组
collect_list  对收集后的数据不去重
collect_set   对收集后的数据去重
通常结合 concat_ws函数使用，对收集到的数据进行自定义拼接
eg：
	select constellation,
       blood_type,
       concat_ws('|',collect_list(name))
    from dep
    group by
        constellation,
        blood_type;

json类型的数据

解析json类型的数据 每个字段的顺序不能改变
json_tuple 将JSON按照需要的字段名匹配，存放到一个里面 搭配 lateral view 侧窗口函数型成一张虚拟表
lateral view
json_tuple(
	josn字段名,
	'需要的字段',
	...
) 别名
as 
	'需要字段的名字',
	... 和上面拆解顺序保持一致
	
		
获取单个json中的字段
get_json_object(string json_string, string path) 
eg: get_json_object(json,"$.想要获取的字段名")
	get_json_object(json,"$.id")
	
get_json_object支持的string_path
$ :root object
. :Child operator
[] : subscript operator for array
* : wildcard for []

不支持的string_path
空字符作为key
.. :
@ 
()
?()
[,]
[start.step]

josn_tuple
eg：有如下json

{"store":
  {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
   "bicycle":{"price":19.95,"color":"red"}
  },
 "email":"amy@only_for_json_udf_test.net",
 "owner":"amy"
}
select get_json_object(json,'$.owner') from json;
		--> 'amy'

select get_json_object(json,'$.store.fruit\[0]') from json;
		--> {"weight":8,"type":"apple"}

 
hive> select get_json_object(json, '$.non_exist_key') from json;
		--> NULL

解析url

parse_url(string urlString, string partToExtract [, string keyToExtract])
返回值：string
返回url中指定的部分 partToExtract的有效值为
							HOST,# facebook.com
							PATH,# /path1/p.php
							QUERY, # k1=v1&k2=v2 
							REF, # Ref
							PROROCOL,
							AUTHORITY, # facebook.com
							FILE, # /path1/p.php?k1=v1&k2=v2
							USERINFO
select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST');
select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1');
--> v1
##### parse_url_tuple
与json_tuple类似

SELECT b.*
FROM src 
LATERAL VIEW 
parse_url_tuple
(
	fullurl, 
	'HOST', 
	'PATH', 
	'QUERY', 
	'QUERY:id'
) b 
as 
	host, 
	path, 
	query, 
	query_id 
LIMIT 1;

usage example

explode(array)

select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;

炸裂array类型的数据，会将其炸为一列数据
eg:
A
B
C

explode(map)


select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;

炸裂map集合，会将map炸成两列 分别为key和对应的value
结果如下
key	value
A	10
B	20
c	30

posexplode(array)

select posexplode(array('A','B','C'));
select posexplode(array('A','B','C')) as (pos,val);
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;

posexplode(array) 会将array炸成两列 第一列是行编号从0开始即值在数组中的索引值，第二列式array的每个值
eg：
pos val
0	A
1	B
2	c

nline (array of structs)

拆解结构体数据

select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;

将结构体中的每个字段拆解到一行
eg：
col1	col2	col3
A		10		2015-01-01
B		20		2016-02-02