hive学习笔记_hive left semi-CSDN博客

本文链接：https://blog.csdn.net/weixin_48412167/article/details/129135720

HIVE

组件

用户接口：CLI、JDBC/ODBC、WebGUI
metastore：存储元数据
Driver模块：解析器、编译器、优化器、执行器

hive特点

Hive使用的是类SQL的语言HQL。
Hive不支持记录级别的更新、插入、删除，因为HDFS的限制。
Hive会将查询任务转换成MapReduce任务，所以即使是很小的查询，也会比较慢。
Hive不支持事务（联机事务处理，OLTP）。

Hive执行过程

执行：所有的命令和查询都会进入Driver（驱动模块），该模块对输入进行解析编译，对需求的计算进行优化，然后转换成MapReduce任务执行。Hive通过XML文件驱动执行内置的、原生的Mapper和Reducer模块。
通信：Hive通过和JobTracker通信来初始化MapReduce任务。

Hive和HBase

HBase可以进行Hive无法完成的特性，如：行级别的更新、快速查询相应时间。

Hive元数据存储

metastore：本地模式一般使用默认的Derby，集群模式一般使用MySql。Hive的类路径下是没有JDBC驱动的，需要复制。

Hive命令行

一次使用：执行结束后会退出

$ hive -e  "select * from table limit 3";
$ hive -S -e  "select * from table limit 3"; #可以去掉OK等字样

执行文件中的hql

$ hive -f /path/to/file/query.hql;
# 或者在客户端
hive>source /path/to/file/query.hql;

Hive 客户端中使用shell和hadoop命令

hive>！ pwd;	#使用！
hive>dfs -ls;		#直接使用dfs

显示字段名称

set hive.cli.print.header=true;

数据类型

基本数据类型：tinyint、smalint、int、bigint、boolean、float、double、string、timestamp、binary。
集合数据类型：Struct、map、array。

create table employee (
	name string,
	salary  float,
	subordinates array<string>,
	deductions map<string,float>,
	address struct<street:string,city:stirng,state:string,zip:int>
)
row format delimited	# 格式设置（除stored之前都是）
fields terminated by '\001'	# 列分隔符
collection items terminated by '\002'	# 集合元素间分隔符（map,struct,array）
map keys terminated by '\003'	# map键和值之间的分隔符
lines terminated by '\n'	# 目前仅支持\n 换行
stored as textfile;	# 默认情况下存储格式都是textfile

模式

传统数据库一般是：写时模式，即数据在写入数据库时，对模式进行检查。
Hive：读时模式，即查询时验证。所以若字段长度不匹配或者格式不匹配，将会出现null值。

Hive数据库

操作数据库

hive>create database financials;	# 创建数据库
hive>create database if not exists financials;
hive>show databases;	# 显示数据库列表
hive>show databases like 'h.*';

hive>create database if not exists financials	# 创建数据库存储在指定位置
	>location '/my/prefer/directory';
hive>create database financials 	# 创建数据库增加描述信息
	>comment 'holds all financial tables';
hive>describe database financials;	# 显示

hive>create database financials	# 创建数据库增加属性信息
	>with dbproperties('creator'='Bob','date'='2012-01-02');
hive>describe database extended/formatted financials;	

hive>set hive.cli.print.current.db=true;	# 显示所在数据库

hive>alter database financials set dbproperties('edited-by'='Maple')

# 删除数据库（如果数据库中有表，不能直接删除，或者加cascade直接删除）
hive>drop database if exists financials cascade;

操作表

hive>create table if not exists mydb.employee (
		name string comment 'employee name',
		salary  float comment 'employee salary',
		subordinates array<string> comment 'names of subordinates',
		deductions map<string,float>,
		address struct<street:string,city:stirng,state:string,zip:int>
)
comment 'description of the table'
tblproperties('creator'='Bob','date'='2012-01-02')
location '/my/prefer/directory';

# 拷贝一张已经存在的表模式schema（不拷贝数据）
hive>create table if not exists mydb.employee like mydb.employee;

hive>describe formatted mydb.employee;

内部表（管理表）和外部表

删除内部表会删除元数据和数据；
删除外部表只删除元数据，不删除数据；

外部表也可复制，如果源表是内部表，在不指定external的情况下，复制的为内部表；如果源表为外部表，复制的也为外部表；
外部表创建需要指定读取位置。内部表指定的路径是写入位置。

create table if not exists financial2 
like mydb.financials 
location '/path/to/data'

分区表

分区表：分区表可加快查询速度，会在对应子目录下按分区创建目录。

create table if not exists mydb.employee (
		name string comment 'employee name',
		salary  float comment 'employee salary',
		subordinates array<string> comment 'names of subordinates',
		deductions map<string,float>,
		address struct<street:string,city:stirng,state:string,zip:int>
)
partitioned by (country string,state string);

# 分区表的严格模式和非严格模式:严格模式下，对分区表进行查询时需要用where过滤，否则禁止提交任务
hive>set hive.mapred.mode=strict;
hive>set hive.mapred.mode=nostrict;

# 增加分区
hive>alter table log add partition(year=2012,month=1,day=2) 
	>location 'hdfs://master_server/data/log/2012/1/2';
	
# 显示分区
hive>show partitions table;

修改表

# 重命名
alter table log rename to logs;
# 增加、修改、删除表分区
alter table log add if not exists 
partition(year=2012,month=1,day=2) location '/logs/2012/1/2'
partition(year=2012,month=1,day=3) location '/logs/2012/1/3'
partition(year=2012,month=1,day=4) location '/logs/2012/1/4';
# 修改分区
alter table log partition(year=2011,month=12,day=4) 
set location 'hdfs://logs/2011/12/4';
# 删除分区
alter table log 
drop if exists partition(year=2011,month=12,day=2);

# 修改列信息
alter table log 
change column hms hours_minutes_seconds int
comment 'timestamp'
after severity;	# 将字段转移到severity之后

# 增加列
alter table log
add column(
	app_name string comment 'Application name',
	session_id long comment 'The current session id'
);

# 删除或者替换列
alter table log 
replace column(
	...
)

# 修改表属性
alter table log
set tblproperties(
	...
);

# 修改存储属性
alter table log partition(year=2011,month=12,day=2)
set fileformat sequencefile;

数据操作

# 装载数据

# 通过load装载
load data local inpath '/path/employee'	# 本地位置,带local：复制到分布式系统，不带local：转移到分布式文件系统
overwrite into table employees	# overwrite 覆写，不写则变成新增
partition(country='US',state='CA');	# 分区表，如果非分区表可省略

# 通过查询语句装载
insert overwrite table employees partition (country='US',state='CA') # 不使用overwrite，就是追加
select * from state_employees se
where se.cnty='US' and se.st='CA';

# 批量查询装载(只需要查询目标表一次)
from state_employees se
insert overwrite table employees partition (country='US',state='CA')
	select * where se.cnty='US' and se.st='CA'
insert overwrite table employees partition (country='US',state='OR')
	select * where se.cnty='US' and se.st='OR'
insert overwrite table employees partition (country='US',state='IL')
	select * where se.cnty='US' and se.st='IL';

# 动态分区插入
insert overwrite table employees partition (country,state) # 根据位置确定分区
select ...,se.cnty,se.st from state_employees se 	# select 最后两个字段对应分区

# 静态和动态可混合使用
insert overwrite table employees partition (country='US',state) 
select ..., se.st from state_employees se
where se.cnty='US';

# 动态分区默认没有开启
hive>set hive.exec.dynamic.partition=true; # 开启动态分区
hive>set hive.exec.dynamic.partition.mode=nostrict; # 开启动态分区的非严格模式，表示所有分区都可以是动态的
hive>set hive.exec.max.dynamic.partitions.pernode # mapper/reducer可以创建的最大分区个数

# 创建表并装载其他表中数据
create table ca_employee 
as 
select name,salary,address 
from employees 
where se.state='CA';

# 导出数据
hadoop fs -cp source_path target_path

insert overwrite local directory '/tmp/employees'
select name,salary,address 
from employees 
where se.state='CA';

# 不同粒度的分区
group by os,device,city
grouping sets((os,device),(city),());// 根据不同的粒度分区后，整合到一张表中。

查询

和常用的SQL差不多，只提几点注意。

类型转换函数：cast(<expr> as <type>)
字符串连接函数：concat(string s1,stirng s2)
where后不能接聚合函数，不能使用列别名，可嵌套select。过滤可加快执行速度。
group by不能使用别名。
浮点数比较：float和double精度不一样，float(0.2=0.2000001),double(0.2=0.200000000001)

join 优化

join优化:join的表的大小从左到右依次增加。
left semi join = inner join,不过semi join比inner join高效，对于左表指定的记录，在右边表中一旦找到匹配的记录，就会停止扫描。
map-side join：join的表中只有一个表是小表的情况下，可以将小表放到内存中。开启
```
hive>set hive.atuo.convert.join=true;
```

order by 和sort by

在reducer端：

order by：全局有序，运行时间长。
sort by：局部有序，运行相对较快。

带distribute by的sort by

distribute by 将相同key的记录放到同一个reducer中。
group by和distribute by 是控制reducer如何处理数据，sort by是控制着reducer如何排序。distribute by需要写在sort by前。
cluster by相当于具有相同字段的distribute by和sort by。

抽样

TABLESAMPLE (BUCKET x OUT OF y [ON colname])  # 针对colname 随机进入1到y个桶中，返回属于桶x的行
TABLESAMPLE (0.1 percent)  # 抽取10%的样本，如果样本小于HDFS的块，返回所有数据

视图

视图可以允许保存一个查询并像对待表一样对这个查询进行操作。（视图是只读的）

使用视图可降低查询复杂度
使用视图来限制基于条件过滤的数据：只将视图指向表中指定的字段。
不对外公开数据表全部字段

# 创建视图
create view shorter_view as 
select * 
from perpon join cart
on cart.perpon=person.id
where firstname='maple'


create view shorter_view(time,part) as 
select cols["time"],cols["parts"] 
from perpon 
where cols["type"]='maple'
删除视图
drop view if exists shorter_view

分桶

# 分桶初始化，需要设置一个正确的reducer。分桶个数与reducer个数相同。
hive>set hive.enforce.bucketing =true;

性能优化

join优化：join顺序，表大小递增；小数据集开启map-side join。

本地模式：小输入量可以临时开启本地模式，如：

hive>set oldjobtracker=${hiveconf:mapred.job.tracker}	# 保存之前模式
hive>set mapred.job.tracker=local;	# 改成本地模式
hive>set mapred.tmp.dir=/home/edward/tmp;	# 临时路径
hive>set mapred.job.tracker=${oldjobtracker};	# 还原之前模式
或者：设置自动开启
hive.exec.mode.local.auto=true;	# hive将自动开启优化

并行执行：没有依赖的执行阶段可以并行执行。通过设置hive.exec.parallel=true; 开启

调整mapper和reducer的个数：

mapred.reduce.tasks=3;	# 设置reducer个数
hive.exec.reducers.max=2;	# 限制reducer最大使用个数，防止阻塞其他job(集群总reducer个数*1.5/执行的查询个数)

JVM重用：设置mapred.job.reuse.jvm.num.task数量，减少JVM的启动。缺点是一直占用task插槽直到所有task结束。

严格模式：

开启严格模式可禁止三种类型的查询：

对于分区表，必须使用where指定分区；
对于使用order by语句的查询，必须使用limit限制；
限制笛卡尔积的查询；

压缩

IO密集和CPU密集。
hadoop是IO密集的，解压缩是需要CPU开销，所以压缩对于hadoop系统是可以提高性能的。

$ hive -e "set io.compression.codecs"	# 查看编解码器

BZip、GZip、Snappy、LZO压缩分析

	GZip	BZip	Snappy	LZO
解压缩速度	相对较慢	相对较慢	速度快	速度快
压缩比例	较高	最高	相对较小	相对较小
是否可分割	不可切分	可切分	不可切分	可切分

inputformat：读取划分
outputformat：写入划分

sequence file：可以将一个文件划分成多个块，然后采用一种可分割的方式对块进行压缩。如果使用需要在创建表的语句中通过stored as sequencefile指定。sequence file提供三种压缩方式，none，record（默认），block；可通过mapred.output.compression.type设置。

开启中间压缩

配置文件中，设置hive.exec.compress.intermediate为true，开启中间压缩。hadoop默认压缩是DefaultCode,SnappyCode是比较好的中间文件压缩编/解码器，可在配置文件中设置mapred.map.output.compression.codec。

开启最终结果压缩

配置文件中，设置hive.exec.compress.output为true，开启结果压缩。对于结果使用GZip压缩比例大，但是对于它之后mapreduce job而言是不可分割的，看情况使用。可在配置文件中设置mapred.output.compression.codec。

同时开启

hive>set hive.exec.compress.intermediate=true;	# 开启中间压缩
hive>set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;	# 中间压缩格式
hive>set hive.exec.compress.output=true;	# 开启输出结果压缩
hive>set mapred.output.compression.type=BLOCK;	# 使用sequence file分块
hive>set mapred.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;	# 输出压缩格式

Hadoop Archive（归档）

har归档：可以减轻NameNode压力。
hive可以创建分区表将文件加载到表中，可使用如下操作：

alter table ... archive partition	# 将分区表转换成归档表
alter table ... unarchive partition	# 将归档表拆分成hadoop归档表

函数

hive>show functions;	# 显示函数列表
hive>describe function concat;	# 简单介绍函数
hive>describe function extended concat;	# 详细介绍函数


UDF:
列转行
concat(constellation, ",",blood_type) 
concat_ws('|',collect_set(t1.name)) 

UDTF:
行转列
lateral view udtf(expression) tableAlias AS columnAlias

自定义函数：

继承UDF，编写evaluate()函数，打jar包。
继承GenericUDF类，可以根据输入处理复杂的逻辑。
只在当前会话有效，或者是加入到.hiverc中。想要永久使用该函数，可以重新编译hive源码。

标准函数UDF(一个或多个输入，单个输出)

继承UDF

hive>add jar /full/path/to/zodiac.jar;	# 添加jar包
hive>create temporary function zodiac 	
	>as 'org.apache.hadoop.hive.contrib.udf.example.UDFZodiacSign';	# 定义这个函数名和函数指定的类
hvie>drop temporary function if exists zodiac;	# 删除函数

继承GenericUDF类

聚合函数 UDAF(0个或多个输入，单个输出)

继承UDAF
```
select avg(price) from stocks;
```
继承GenericUDF类

表生成函数 UDTF(0个或多个输入，多个输出)

	select explode(array(1,2,3)) from dual;
	
	select name,sub
	from employees
	lateral view explode(split(subordinates,",")) subView as sub;