《Hive编程指南》学习随笔（1）——基础补充_讨论主题:通过对本学期hive课程的学习,谈一谈学习编程的看法;要求字迹清晰,表-CSDN博客

本文链接：https://blog.csdn.net/qq_41059320/article/details/93507245

0. 前言

其实还没有很系统的学过Hive，是时候要系统的研究一下了。

该系列文章针对的是用过Hive基础的人，不会说太常见的知识，目的在于记录一些平时没有注意到的地方，或者需要深入学习的地方。

参考——《Hive编程指南》

1. Hive在HDFS上的数据存储

大家都知道Hive是基于Hadoop上的数据仓库，也知道Hive的数据是存储在HDFS上的，可是，到底是怎么个存储形式呢？今天探究了一下。

说白了，还是一个个文件夹和文件形式存储在HDFS上的，如果大家有配置Apache Hadoop的经验的话，知道有几种模式：单机模式（默认），伪分布式式，分布式，并且由属性文件（core-site.xml，hdfs-site.xml）来配置其属性。既然Hive是基于Hadoop的数据仓库，那么Hive也是一样的，由属性来指定存储路径，默认为hive.metastore.warehouse.dir属性，其值如下表：

模式	默认路径
单机	file:///user/hive/warehouse
（伪）分布式	hdfs://namenode_sever//user/hive/warehouse

《Hive编程指南》

Hive中数据库的概念，本质上仅仅是一个表的目录或命名空间，对于具有很多组和用户的大集群来说，这很好的避免了命名冲突。

Hive会为每个数据库建一个目录，数据库中的表会以子目录的形式存在于该目录下。而数据库所在的目录就位于上述属性hive.metastore.warehouse.dir指定的顶层目录下。

那么，我们来对应一下。假设默认目录就是/user/hive/warehouse

创建一个目录 /user/hive/warehouse/mydb.db，没错，这是一个以db为后缀的文件目录；
1. create mydb 


再创建一个目录/user/hive/warehouse/mydb.db/mytable；
2. use mydb; create mytable 


创建分区时，会根据指定创建分区的位置关系，生成对应级别的子文件夹（目录）生成
/user/.../mytable/region=1/class=A
/user/.../mytable/region=1/class=B
...
/user/.../mytable/region=26/class=Z
3. create table mytable (
name string
age int
class string
region sting
)
partition by (region string, class string);


插入数据时，生成的就是在对应的子文件夹下，一个个的文件了，具体是什么样的还没看到
4. insert into table

嘿嘿嘿，明白了吧，分区就是再分了子文件夹，这样就不用每次查询都遍历所有文件了，直接定位所在的分区（文件夹），可以大大提升查询效率。

2. 简单任务(select * )不走MR过程

参考
hive ：简单查询不走mapreduce

相信任何学习Hive的童鞋都这么听过：Hive将类SQL语句解析为MR作业执行，但是有一点，在执行下面语句时，出于效率考虑，Hive不会将其转为MR程序执行，而是直接读取table对应的存储目录下的文件，然后输出格式化后的内容到控制台上。

1. 选择全部
select  * from table [limit xx] 

2. where中的字段为分区字段时
select  * from table where field = xx [limit xx]

3. limit无论使用与否，不影响

有两个配置的地方可以考虑：

1. 设置为more，简单查询就不走map/reduce了，设置为minimal，就任何简单select都会走map/reduce。
set hive.fetch.conversion = more 

2. hive会尝试使用本地模式执行其他操作，推荐设置为true
set hive.exec.model.local.auto = true

3. Hive本不支持单条记录的insert、update、delete

这个在以前的周报中也提到过，Hive原先是不支持单挑记录的这些操作的，只能对整个表一起进行改动，但是据官网说，从Hive 0.14版本后就支持了，但默认是关闭的，要支持也是有条件的。

参考
官网对于insetrt、update、delete的说明
 Hive实现update和delete

也就是说，要想支持单条记录级别的操作，就需要表支持ACID，那如何使表支持ACID呢，又要满足以下几个条件：

表的存储格式必须为ORC；
表必须进行分桶；
表的属性（table property）必须设置为transactional = true；
需要在客户端和服务端进行一些配置；

更加具体的信息暂时先不细究了，可以参见上述资料。

4. Join中处理null值的语义区别

标准ANSI-SQL中，任何对Null值的操作（数值比较、字符串连接等）结果都为Null，在Hive中也基本一样，但是有一个地方特殊一些，那就是在Join的时候。

Hive在Join的时候，null值是可以用来作为 join key的字段进行比较的，也就是说 null = null的比较有意义，且返回值为True。

煮个栗子：

1. 未过滤null值
	select t1.uid, sum(t2.money)
	  from t1
	  join t2
	    on t1.uid = t2.uid
  group by t1.uid

上述语句正常在执行时候，如果两个表中都有 uid = null 的情况，那t1.uid = t2.uid = null成立。这样会导致一个问题，就是 join 时空值过多产生的数据倾斜。

如果要过滤掉这种情况的话，我们需要改写查询逻辑，手动过滤Null值：

	select t1.uid, sum(t2.money)
	  from t1
	  join t2
	    on (t1.uid = t2.uid and t1.uid is not null and t2.uid is not null)
  group by t1.uid

5. 通过select 插入数据时的两种语法

使用insert插入数据时，可以看到以下这种用法

insert overwrite table mytable
partition(region=2, class='B')
select * from t1 where t1.region = 2 and t1.class = 'B'

这样是没问题的，但是，如果我要插入多个分区呢？10个，是否执行10条这个语句呢？那么就会执行10次from操作，也就是扫描t1表10次，如果这个表很大，那这个速度就很耐人寻味了。

此时，Hive提供了另外一种insert语法，可以只扫描一次输入数据，然后按照多种方式进行划分。

from t1
insert overwrite mytable
	partition(region=2, class='B')
	select * from t1 where t1.region = 2 and t1.class = 'B'
insert overwrite mytable
	partition(region=2, class='C')
	select * from t1 where region = 3 and class = 'C';
...

这里就只扫描了一次t1表（只用了一个from），然后根据条件进行判断是否插入数据。

需要说明的是，这并不是if…else…结构，而是表里的每一条记录都会对每个条件进行判断，只要符合其条件，就会被插入，所以源表中的某些数据可以被写入到目标表的多个分区中，或者不被写入到任何一个分区中。

6. 静态分区与动态分区的结合

动态分区
在使用insert向分区表中插入数据时，一般用法如上第5点，但是，如果分区很多，我想向每个分区中复写数据，就需要非常多的重复sql了，这里可以使用动态分区：

insert overwrite mytable
partition(region, class)
select ...,t1.reg, t1.cla from t1

这里就会自动根据最后两个字段来组合分区了。需要注意的是，这里的动态分区字段，并不是根据字段名来指定的，而是根据位置，啥意思呢，将上面代码改一下。

insert overwrite mytable
partition(region, class)
select reg, cla, id, name, age from t1

这时，就会选择select 中的最后两个字段（name，age），作为分区字段，与ppartition(region, class)进行匹配，所以说，跟名字没有关系，而是看你select中的，最后两个位置的字段。

放一个比较详细的参考博客HIVE动态分区实战

静态动态分区结合
如果我只是想对region = 2里面的所有类别进行分区复写呢？那我还是指定分区 region = 2，然后对class字段进行动态分区就行了，这就是所谓的结合了。

insert overwrite table mytable
partition(region = 2, class)
select ..., region, class from t1
where region = 2

这里有一点要注意，静态分区值必须在动态分区值的前面，这个也好理解。

7. 浮点数的比较中的陷阱

先放张笔记图片，具体的明天来讲。

在这里插入图片描述

浮点数尽量使用double，对于已经使用float的要使用cast操作符，不过将浮点型转换为整数，推荐使用round()或floor()，而不是用cast
和钱相关的都避免使用浮点数 cast(0.2 as float)

8. left semi join 和 cross join

一、left semi join

hive不支持 in/exists 操作的，比如下面这样。

select * 
  from tmp_fltdb.tmp_from
 where orderid in (select orderid from tmp_fltdb.tmp_back)
就会报错：
FAILED: SemanticException [Error 10249]: Line 2:6 Unsupported SubQuery Expression 'orderid': Correlating expression cannot contain unqualified column references.

在hive中可以使用left semi-join 来达到同样的效果

	select * 
	  from tmp_fltdb.tmp_from t1
left semi join tmp_fltdb.tmp_back t2
	 	on t1.orderid = t2.orderid


2. 或者用is not null
select t1.orderid 
	  from tmp_fltdb.tmp_from t1
 left join tmp_fltdb.tmp_back t2
		on t1.orderid = t2.orderid
	 where t2.orderid is not null

not in 同理，也不支持，所以写的时候可以这样：

	select t1.orderid 
	  from tmp_fltdb.tmp_from t1
 left join tmp_fltdb.tmp_back t2
		on t1.orderid = t2.orderid
	 where t2.orderid is null

二、cross join

这就是笛卡尔积了，也就是说，n行的表与m行的表进行cross join，就会得到一个n*m行的表了，而且不需要指定关联键（反正是每一行都要关联一下的）

1. cross join
set hive.mapred.mode=nonstrict;
	select * 
	  from tmp_fltdb.tmp_from t1
cross join tmp_fltdb.tmp_back t2

2. 效果和上面一样
set hive.mapred.mode=nonstrict;
	select * 
	  from tmp_fltdb.tmp_from t1
, tmp_fltdb.tmp_back t2

补充一下参考博文：
[一起学Hive]之十一-Hive中Join的类型和用法

hive的join操作目前只支持等值连接，而且在join的条件中不支持 or 。

9. 使用 cluster by 会剥夺 sort by的并行性，然而可以实现输出文件的数据是全局排序。

对于这句话我是一直不清楚，且持有怀疑态度，虽然是在《Hive编程指南》上的原话，为什么会剥夺并行性并没有讲，网上一搜也就全是这句话，并没有后续解释。可以实现全局排序是可以理解。

我自己亲测一下：

1. 乱的，基本没啥作用
select * from dim_fltdb.dimcity
where countrycode = 'CN'
distribute by provinceid

2. 按照cityid全局排序了
select * from dim_fltdb.dimcity
where countrycode = 'CN' sort by cityid

3. 按照cityid全局排序了
select * from dim_fltdb.dimcity
where countrycode = 'CN'
distribute by provinceid sort by cityid

4. 按照provinceid全局排序了
select * from dim_fltdb.dimcity
where countrycode = 'CN'
distribute by provinceid sort by provinceid

5.同上，按照provinceid全局排序了
select cityid,provinceid from dim_fltdb.dimcity
where countrycode = 'CN'
cluster by provinceid

6. 局部排序，全局无序
select cityid,provinceid from dim_fltdb.dimcity
where countrycode = 'CN'
distribute by provinceid sort by cityid desc