hive的进阶之路2

10 篇文章 2 订阅
7 篇文章 1 订阅

hive的分桶:

为了更加细粒度划分数据。

语法:

[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)
[SORTED BY (COLUMNNAME [ASC|DESC])…] INTO NUM_BUCKETS BUCKETS]

抽样查询

join提高查询效率

分区&分桶

分区下创建分桶表

表下创建分桶表

分桶实现

默认,分桶技术实现是按照分桶字段进行hash值模于总桶数得到的值即是分桶数。

创建分桶表

create table if not exists buc1(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;

分桶表不允许使用load方式粗暴加载(不会成分桶样子),可以加载成功,使用load相当于直接数据复制过去,不会有分桶实现
load data local inpath '/home/hivedata/buc1' into table  buc1;
-------创建表
create table if not exists buc2(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;

分桶表使用insert into来加载:
set hive.exec.mode.local.auto=true;
set hive.enforce.bucketing=true;
insert overwrite table buc2
select id,name,age from temp1
cluster by(id)
;
---分区和分桶的同时实现---------------------
set hive.exec.mode.local.auto=true;
create table pabuck3(
id int,
name string,
age int
)
partitioned by (year int)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;
---插入数据
set hive.enforce.bucketing=true;set hive.enforce.bucketing=true;
insert into table pabuck3 partition(year)
select id,name,age,year from temp1
cluster by(id);
分区分桶案例:
分区关键字在前,分桶关键字在后
partitioned by
clustered by(id) into 8 buckets
----------------------------------------------
注意:
分区使用的是表外字段,分桶使用的是表内字段
分桶更加细粒度的管理数据,更多的是使用来做抽样、join

----------------------------------------------

--查询:
select * from tablesample(bucket x out of y on id)
x不能大于y。
x:从哪一个桶开始取数
y:总的桶数,一般是总桶数的倍数或者因子


总:分桶字段模于y后+1等x的值将会被查询出来。
001 002 003 004 
1%4=1 1+1=2放进第二个桶,第一个桶放莫为0

查询:
select * from buc2;
select * from buc2 tablesample(bucket 1 out of 1 on id);

查询某通数据:
select * from buc2 tablesample(bucket 1 out of 4 on id);
select * from buc2 tablesample(bucket 2 out of 4 on id);
select * from buc2 tablesample(bucket 1 out of 2 on id);
select * from buc2 tablesample(bucket 2 out of 2 on id);
select * from buc2 tablesample(bucket 1 out of 8 on id);

select * from buc2 tablesample(bucket 2 out of 3 on id);


select 
* 
from buc2 
tablesample(bucket 1 out of 4 on id)
;

select 
* 
from buc2 
tablesample(bucket 2 out of 3 on id)
where id > 5
;

随机:
select * from buc1 order by rand() limit 3;
select * from buc1 tablesample(3 rows);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(3M);   --B\K\M\G


查询语句:
select
from
join
on
where 
group by
[grouping sets/with cube/with rollup]
having
order by/sort by 
limit
union/union all
;

执行顺序:
FROM 
<left_table>
ON 
<join_condition>
<join_type>
 JOIN 
<right_table>
WHERE 
<where_condition>
GROUP BY 
<group_by_list>
HAVING 
<having_condition>
SELECT
DISTINCT 
<select_list>
ORDER BY 
<order_by_condition>
LIMIT 
<limit_number>

尽量不要使用子查询、尽量不要使用 in 和 not in
查询尽量避免join连接查询,但是这种操作咱们是永远避免不了的。
查询永远是小表驱动大表(永远是小结果集驱动大结果集)

内连接(inner join)
外连接(outer join)
左连接(left join)
右连接(right join)
全连接(full join)

常用:
left join 
inner join

left join:以左表为基本表,右表关联不上用null替代
left join \ left semi join \ left outer join


left semi join 和 left join 区别:
1、都是左表连接,但是semi join右表关联不左表也不会出来,left join不一样
2、semi join只能查询左表信息,left join可以查询所有
3、semi join是left join的一种优化
4、semi join一般使用查询存在的情况

案例:
准备数据
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp

create table if not exists u1(
id int,
name string
)
row format delimited fields terminated by ','
;

create table if not exists u2(
id int,
name string
)
row format delimited fields terminated by ','
;

load data local inpath '/home/hivedata/u1' into table u1;
load data local inpath '/home/hivedata/u2' into table u2;

u1表中存在,但是b表中并不存在的:
set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left join u2 u2
on u1.id = u2.id
where u2.id is null
;

set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left outer join u2 u2
on u1.id = u2.id
where u2.id is null
;


set hive.exec.mode.local.auto=true;
select
u1.id
from u1 u1
left semi join u2 u2
on u1.id = u2.id
;


right semi join在hive中不支持,其它的right join\right outer join都支持。


full outer join:取并集
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
full outer join u2 u2
on u1.id = u2.id
;


inner join(join 和 多表加,) :
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
inner join u2 u2
on u1.id = u2.id
;

set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1, u2 u2
where u1.id = u2.id
;

小表标识:
hive提供小表标识:使用的是STREAMTABLE(小表别名)
set hive.exec.mode.local.auto=true;
select
/*+STREAMTABLE(d)*/
d.name,
e.name
from u1 e
join u2 d 
on d.id = e.id
;

子查询:
支持不是特别好,对 = 支持不好
select
e.*
from u1 e
where e.id = (
select 
d.id
from u2 d
where d.id = 2
)
;

select
e.*
from u1 e
where e.id in (
select 
d.id
from u2 d
limit 1
)
;


inner join 和outer join的区别:
分区字段对outer join 中的on条件是无效,对inner join 中的on条件有效

有inner join 但是没有full inner join , 有full outer join但是没有outer join
所有join连接,只支持等值连接(= 和 and )。不支持 != 、 < 、> 、 <> 、>=、 <= 、or
map端join:
hive1.x版本默认开启,可以适当修改小表大小。
set hive.auto.convert.join=true;   #是否开启map端join的优化
set hive.mapjoin.smalltable.filesize=25000000; #小表大小, 约23.8MB

案例:
select
e.*
from u1 d 
join u2 e
on d.id = e.id
;

create table if not exists comment(
id string,
commentid string,
dt string,
comm string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/wb.db/ods_bhv_comment'
;

set hive.exec.mode.local.auto=true;
select
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;


mapjoin的标识:
select
/*+MAPJOIN(c)*/
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;

**ps:**where后不能跟聚合函数,普通函数可以

group by :分组,通常和聚合函数搭配使用。

如果语句带group by子句,则 select后面的字段要么在group by中出现,要么在聚合函数里面。

一般有group by出现将会有reducetask。

Having : 分组以后的结果再次过滤。

limit : 取多少条。默认100000条,文件个数10个文件

**cluster by :**兼有distribute by以及sort by的升序功能。
排序只能是升序排序(默认排序规则),不能指定排序规则为asc 或者desc。

set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
cluster by id
limit 10
;

set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
distribute by id
sort by id asc
limit 10
;

distribute by : 根据by后的字段和reducer个数,决定map的输出去往那个reducer。
默认使用查询的第一列的hash值来决定map的输出去往那个reducer。如果reducer的个数为1时没有任何体现。

sort by:局部排序,只保证单个reducer有顺序。
order by:全局排序,保证所有reducer中的数据都是有顺序。
如果reduser个数只有一个,两者都差不多。
两者都通常和 desc 、 asc 搭配。默认使用升序asc。

手动设置reducer个数:
set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
order by e.id desc 
;

set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
sort by e.id desc 
;

order by的缺点:
由于是全局排序,所以所有的数据会通过一个Reducer 进行处理,当数据结果较大的时候,一个Reducer 进行处理十分影响性能。

sort by的特点:
常用于局部分组的时候的排序。

注意事项:
当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ,否则会报错


只要使用order by ,reducer的个数将是1个。
如果sort by 和 distribute by 同时出现:那个在前面??
如果sort by 和 distribute by 同时出现,并且后面的字段一样、sort by使用升序时  <==> cluster by 字段

union 和 union all :
union :将多个结果集合并,去重,排序
union all :将多个结果集合并,不去重,不排序。

set mapreduce.job.reduces=1; --设置reduce的数量为一
select
d.id as deptno,
d.name as dname
from u1 d
union
select
e.id as deptno,
e.name as dname
from u2 e
;

set mapreduce.job.reduces=1;
select
d.id as deptno,
d.name as dname
from u1 d
union all
select
e.id as deptno,
e.name as dname
from u2 e
;


单个union 语句不支持:orderBy、clusterBy、distributeBy、sortBy、limit
单个union语句字段的个数要求相同,字段的顺序要求相同、字段类型需要一样(union)。
union子句不支持group by\distribute by\order by (每个union子句嵌套中可以使用)

select
*
from (
select 
a.deptno,
a.dname
from(
select
d.id as deptno,
d.name as dname
from u1 d
limit 1 ) a  --使用union / union all 的时候这两个里面直接嵌套limit,group by 这些语句会报错
union all    --可以在要使用的表的外层重新嵌套一个表来实现。如左边所示
select
e.id as deptno,
e.name as dname
from u2 e
) tmp
;

distinct : 去重

select
id,
count(distict commentid) cnt
from comment c
group by id
limit 3
;

hive的 数据类型:

基础数据类型:
tinyint		1	-128~127
smallint	2	-2的15 ~ 2的15-1
int			4
bigint		8
float		4
double		8
boolean		1 true/false
string		
binary			字节
timestamp		2017-06-02 11:36:22

java中有的而hive中没有的:
long
char
short
byte


create table if not exists bs1(
id1 tinyint,
id2 smallint,
id3 int,
id4 bigint,
sla float,
sla1 double,
isok boolean,
content binary,
dt timestamp
)
row format delimited fields terminated by '\t'
;

233 12 342523 455345345 30000 60000 nihao helloworld 2017-06-02
126 13 342526 455345346 80000 100000 true helloworld1 2017-06-02 11:41:30

load data local inpath '/home/hivedata/bs' into table bs1;
复杂数据类型:
array : col array<基本类型> ,下标从0开始,越界不报错,以NULL代替
map   : column map<string,string> 
struct: col struct

array案例:
zhangsan	78,89,92,96
lisi	67,75,83,94

create table if not exists arr1(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
;

注意terminated顺序??
create table if not exists arr2(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
;

导入数据
load data local inpath '/home/hivedata/arr1' into table arr1;

查询:
select * from arr1;
select
*
from arr2
where score[1] > 80;

select
name,
score[1],
score[20]
from arr2
where score[1] > 80;

select 
name,
score
from arr2 
where size(score) > 3;


扩展查询:
zhangsan	78,89,92,96
zhangsan        78
zhangsan        89
zhangsan        92
zhangsan        96

内嵌查询:
explode:展开
select name,explode(score) score from arr2;

lateral view : 虚拟表
select 
name,
sum(cj) sum_cj 
from arr2 
lateral view explode(score) score as cj
group by name
;


统计每个学生的总成绩:
select name,sum(cj) as totalscore from arr1 lateral view explode(score) score as cj group by name;

如何往array字段写入数据:
准备数据:
create table arr_temp
as
select name,cj from arr2 lateral view explode(score) score as cj;

1、collect_set函数:
create table if not exists arr3(
name string,
score array<int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
;

将数据写成array格式:
insert into arr3
select 
name,
collect_set(cast(cj as int)) 
from arr_temp 
group by name;

insert into arr3
select 
name,
collect_list(cast(cj as int)) 
from arr_temp 
group by name;



map案例:
zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:
wangwu chinese:89,math:,english:81,nature:9

create table if not exists map2(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':'
;

加载数据

load data local inpath '/home/hivedata/map' into table map2;

查询:
查询数学大于35分的学生的英语和自然成绩:
select 
m.name,
m.score['english'] ,
m.score['nature']
from map2 m
where m.score['math'] > 35
;

展开数据:
explode
select explode(score) as (m_class,m_score) from map2;

使用lateral view explode结合查询:
select 
name,
collect_list(concat_ws(":",m_class,cast(m_score as string)))
from map2 
lateral view explode(score) score as m_class,m_score
group by name;


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值