hive的进阶之路2

最新推荐文章于 2024-04-15 19:13:07 发布

jhchengxuyuan

最新推荐文章于 2024-04-15 19:13:07 发布

阅读量129

点赞数

本文链接：https://blog.csdn.net/jhchengxuyuan/article/details/100717510

版权

hadoop 同时被 3 个专栏收录

10 篇文章 1 订阅

订阅专栏

大数据

10 篇文章 2 订阅

订阅专栏

hive

7 篇文章 1 订阅

订阅专栏

hive的分桶：

为了更加细粒度划分数据。

语法：

[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)
[SORTED BY (COLUMNNAME [ASC|DESC])…] INTO NUM_BUCKETS BUCKETS]

：

抽样查询

join提高查询效率

分区&分桶

分区下创建分桶表

表下创建分桶表

分桶实现

默认，分桶技术实现是按照分桶字段进行hash值模于总桶数得到的值即是分桶数。

创建分桶表：

create table if not exists buc1(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;

分桶表不允许使用load方式粗暴加载(不会成分桶样子)，可以加载成功，使用load相当于直接数据复制过去，不会有分桶实现
load data local inpath '/home/hivedata/buc1' into table  buc1;
-------创建表
create table if not exists buc2(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;

分桶表使用insert into来加载：
set hive.exec.mode.local.auto=true;
set hive.enforce.bucketing=true;
insert overwrite table buc2
select id,name,age from temp1
cluster by(id)
;
---分区和分桶的同时实现---------------------
set hive.exec.mode.local.auto=true;
create table pabuck3(
id int,
name string,
age int
)
partitioned by (year int)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;
---插入数据
set hive.enforce.bucketing=true;set hive.enforce.bucketing=true;
insert into table pabuck3 partition(year)
select id,name,age,year from temp1
cluster by(id);
分区分桶案例：
分区关键字在前，分桶关键字在后
partitioned by
clustered by(id) into 8 buckets
----------------------------------------------
注意：
分区使用的是表外字段，分桶使用的是表内字段
分桶更加细粒度的管理数据，更多的是使用来做抽样、join

----------------------------------------------

--查询：
select * from tablesample(bucket x out of y on id)
x不能大于y。
x:从哪一个桶开始取数
y:总的桶数，一般是总桶数的倍数或者因子


总：分桶字段模于y后+1等x的值将会被查询出来。
001 002 003 004 
1%4=1 1+1=2放进第二个桶，第一个桶放莫为0

查询：
select * from buc2;
select * from buc2 tablesample(bucket 1 out of 1 on id);

查询某通数据：
select * from buc2 tablesample(bucket 1 out of 4 on id);
select * from buc2 tablesample(bucket 2 out of 4 on id);
select * from buc2 tablesample(bucket 1 out of 2 on id);
select * from buc2 tablesample(bucket 2 out of 2 on id);
select * from buc2 tablesample(bucket 1 out of 8 on id);

select * from buc2 tablesample(bucket 2 out of 3 on id);


select 
* 
from buc2 
tablesample(bucket 1 out of 4 on id)
;

select 
* 
from buc2 
tablesample(bucket 2 out of 3 on id)
where id > 5
;

随机：
select * from buc1 order by rand() limit 3;
select * from buc1 tablesample(3 rows);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(3M);   --B\K\M\G

查询语句：

select
from
join
on
where 
group by
[grouping sets/with cube/with rollup]
having
order by/sort by 
limit
union/union all
;

执行顺序：
FROM 
<left_table>
ON 
<join_condition>
<join_type>
 JOIN 
<right_table>
WHERE 
<where_condition>
GROUP BY 
<group_by_list>
HAVING 
<having_condition>
SELECT
DISTINCT 
<select_list>
ORDER BY 
<order_by_condition>
LIMIT 
<limit_number>

尽量不要使用子查询、尽量不要使用 in 和 not in
查询尽量避免join连接查询，但是这种操作咱们是永远避免不了的。
查询永远是小表驱动大表（永远是小结果集驱动大结果集）

内连接（inner join）
外连接（outer join）
左连接（left join）
右连接（right join）
全连接（full join）

常用：
left join 
inner join

left join:以左表为基本表，右表关联不上用null替代
left join \ left semi join \ left outer join


left semi join 和 left join 区别：
1、都是左表连接，但是semi join右表关联不左表也不会出来，left join不一样
2、semi join只能查询左表信息，left join可以查询所有
3、semi join是left join的一种优化
4、semi join一般使用查询存在的情况

案例：
准备数据
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp

create table if not exists u1(
id int,
name string
)
row format delimited fields terminated by ','
;

create table if not exists u2(
id int,
name string
)
row format delimited fields terminated by ','
;

load data local inpath '/home/hivedata/u1' into table u1;
load data local inpath '/home/hivedata/u2' into table u2;

u1表中存在，但是b表中并不存在的：
set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left join u2 u2
on u1.id = u2.id
where u2.id is null
;

set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left outer join u2 u2
on u1.id = u2.id
where u2.id is null
;


set hive.exec.mode.local.auto=true;
select
u1.id
from u1 u1
left semi join u2 u2
on u1.id = u2.id
;


right semi join在hive中不支持，其它的right join\right outer join都支持。


full outer join:取并集
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
full outer join u2 u2
on u1.id = u2.id
;


inner join(join 和 多表加,) :
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
inner join u2 u2
on u1.id = u2.id
;

set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1, u2 u2
where u1.id = u2.id
;

小表标识：
hive提供小表标识：使用的是STREAMTABLE(小表别名)
set hive.exec.mode.local.auto=true;
select
/*+STREAMTABLE(d)*/
d.name,
e.name
from u1 e
join u2 d 
on d.id = e.id
;

子查询：
支持不是特别好,对 = 支持不好
select
e.*
from u1 e
where e.id = (
select 
d.id
from u2 d
where d.id = 2
)
;

select
e.*
from u1 e
where e.id in (
select 
d.id
from u2 d
limit 1
)
;


inner join 和outer join的区别：
分区字段对outer join 中的on条件是无效，对inner join 中的on条件有效

有inner join 但是没有full inner join , 有full outer join但是没有outer join
所有join连接，只支持等值连接(= 和 and )。不支持 != 、 < 、> 、 <> 、>=、 <= 、or

map端join:

hive1.x版本默认开启，可以适当修改小表大小。
set hive.auto.convert.join=true;   #是否开启map端join的优化
set hive.mapjoin.smalltable.filesize=25000000; #小表大小， 约23.8MB

案例：
select
e.*
from u1 d 
join u2 e
on d.id = e.id
;

create table if not exists comment(
id string,
commentid string,
dt string,
comm string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/wb.db/ods_bhv_comment'
;

set hive.exec.mode.local.auto=true;
select
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;


mapjoin的标识:
select
/*+MAPJOIN(c)*/
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;

**ps:**where后不能跟聚合函数,普通函数可以

group by ：分组，通常和聚合函数搭配使用。

如果语句带group by子句，则 select后面的字段要么在group by中出现，要么在聚合函数里面。

一般有group by出现将会有reducetask。

Having ： 分组以后的结果再次过滤。

limit ： 取多少条。默认100000条，文件个数10个文件

**cluster by ：**兼有distribute by以及sort by的升序功能。
排序只能是升序排序（默认排序规则），不能指定排序规则为asc 或者desc。

set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
cluster by id
limit 10
;

set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
distribute by id
sort by id asc
limit 10
;

distribute by ： 根据by后的字段和reducer个数，决定map的输出去往那个reducer。
默认使用查询的第一列的hash值来决定map的输出去往那个reducer。如果reducer的个数为1时没有任何体现。

sort by:局部排序，只保证单个reducer有顺序。
order by：全局排序，保证所有reducer中的数据都是有顺序。
如果reduser个数只有一个，两者都差不多。
两者都通常和 desc 、 asc 搭配。默认使用升序asc。

手动设置reducer个数：
set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
order by e.id desc 
;

set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
sort by e.id desc 
;

order by的缺点：
由于是全局排序，所以所有的数据会通过一个Reducer 进行处理，当数据结果较大的时候，一个Reducer 进行处理十分影响性能。

sort by的特点：
常用于局部分组的时候的排序。

注意事项：
当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ，否则会报错


只要使用order by ，reducer的个数将是1个。
如果sort by 和 distribute by 同时出现：那个在前面？？
如果sort by 和 distribute by 同时出现，并且后面的字段一样、sort by使用升序时  <==> cluster by 字段

union 和 union all ：

union ：将多个结果集合并，去重，排序
union all ：将多个结果集合并，不去重，不排序。

set mapreduce.job.reduces=1; --设置reduce的数量为一
select
d.id as deptno,
d.name as dname
from u1 d
union
select
e.id as deptno,
e.name as dname
from u2 e
;

set mapreduce.job.reduces=1;
select
d.id as deptno,
d.name as dname
from u1 d
union all
select
e.id as deptno,
e.name as dname
from u2 e
;


单个union 语句不支持：orderBy、clusterBy、distributeBy、sortBy、limit
单个union语句字段的个数要求相同，字段的顺序要求相同、字段类型需要一样(union)。
union子句不支持group by\distribute by\order by (每个union子句嵌套中可以使用)

select
*
from (
select 
a.deptno,
a.dname
from(
select
d.id as deptno,
d.name as dname
from u1 d
limit 1 ) a  --使用union / union all 的时候这两个里面直接嵌套limit,group by 这些语句会报错
union all    --可以在要使用的表的外层重新嵌套一个表来实现。如左边所示
select
e.id as deptno,
e.name as dname
from u2 e
) tmp
;

distinct : 去重

select
id,
count(distict commentid) cnt
from comment c
group by id
limit 3
;

hive的数据类型：

基础数据类型：

tinyint		1	-128~127
smallint	2	-2的15 ~ 2的15-1
int			4
bigint		8
float		4
double		8
boolean		1 true/false
string		
binary			字节
timestamp		2017-06-02 11:36:22

java中有的而hive中没有的：
long
char
short
byte


create table if not exists bs1(
id1 tinyint,
id2 smallint,
id3 int,
id4 bigint,
sla float,
sla1 double,
isok boolean,
content binary,
dt timestamp
)
row format delimited fields terminated by '\t'
;

233 12 342523 455345345 30000 60000 nihao helloworld 2017-06-02
126 13 342526 455345346 80000 100000 true helloworld1 2017-06-02 11:41:30

load data local inpath '/home/hivedata/bs' into table bs1;

复杂数据类型：

array : col array<基本类型> ,下标从0开始，越界不报错，以NULL代替
map   : column map<string,string> 
struct: col struct

array案例：
zhangsan	78,89,92,96
lisi	67,75,83,94

create table if not exists arr1(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
;

注意terminated顺序？？
create table if not exists arr2(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
;

导入数据
load data local inpath '/home/hivedata/arr1' into table arr1;

查询：
select * from arr1;
select
*
from arr2
where score[1] > 80;

select
name,
score[1],
score[20]
from arr2
where score[1] > 80;

select 
name,
score
from arr2 
where size(score) > 3;


扩展查询：
zhangsan	78,89,92,96
zhangsan        78
zhangsan        89
zhangsan        92
zhangsan        96

内嵌查询：
explode:展开
select name,explode(score) score from arr2;

lateral view : 虚拟表
select 
name,
sum(cj) sum_cj 
from arr2 
lateral view explode(score) score as cj
group by name
;


统计每个学生的总成绩：
select name,sum(cj) as totalscore from arr1 lateral view explode(score) score as cj group by name;

如何往array字段写入数据：
准备数据：
create table arr_temp
as
select name,cj from arr2 lateral view explode(score) score as cj;

1、collect_set函数：
create table if not exists arr3(
name string,
score array<int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
;

将数据写成array格式：
insert into arr3
select 
name,
collect_set(cast(cj as int)) 
from arr_temp 
group by name;

insert into arr3
select 
name,
collect_list(cast(cj as int)) 
from arr_temp 
group by name;



map案例：
zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:
wangwu chinese:89,math:,english:81,nature:9

create table if not exists map2(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':'
;

加载数据

load data local inpath '/home/hivedata/map' into table map2;

查询：
查询数学大于35分的学生的英语和自然成绩：
select 
m.name,
m.score['english'] ,
m.score['nature']
from map2 m
where m.score['math'] > 35
;

展开数据：
explode
select explode(score) as (m_class,m_score) from map2;

使用lateral view explode结合查询：
select 
name,
collect_list(concat_ws(":",m_class,cast(m_score as string)))
from map2 
lateral view explode(score) score as m_class,m_score
group by name;

jhchengxuyuan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive的进阶之路2

hive的分桶：为了更加细粒度划分数据。语法：[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)[SORTED BY (COLUMNNAME [ASC|DESC])…] INTO NUM_BUCKETS BUCKETS]：抽样查询join提高查询效率分区&分桶分区下创建分桶表表下创建分桶表...
复制链接

扫一扫