hive的分桶:
为了更加细粒度划分数据。
语法:
[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)
[SORTED BY (COLUMNNAME [ASC|DESC])…] INTO NUM_BUCKETS BUCKETS]
:
抽样查询
join提高查询效率
分区&分桶
分区下创建分桶表
表下创建分桶表
分桶实现
默认,分桶技术实现是按照分桶字段进行hash值模于总桶数得到的值即是分桶数。
创建分桶表:
create table if not exists buc1(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;
分桶表不允许使用load方式粗暴加载(不会成分桶样子),可以加载成功,使用load相当于直接数据复制过去,不会有分桶实现
load data local inpath '/home/hivedata/buc1' into table buc1;
-------创建表
create table if not exists buc2(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;
分桶表使用insert into来加载:
set hive.exec.mode.local.auto=true;
set hive.enforce.bucketing=true;
insert overwrite table buc2
select id,name,age from temp1
cluster by(id)
;
---分区和分桶的同时实现---------------------
set hive.exec.mode.local.auto=true;
create table pabuck3(
id int,
name string,
age int
)
partitioned by (year int)
clustered by (id) into 4 buckets
row format delimited fields terminated by ','
;
---插入数据
set hive.enforce.bucketing=true;set hive.enforce.bucketing=true;
insert into table pabuck3 partition(year)
select id,name,age,year from temp1
cluster by(id);
分区分桶案例:
分区关键字在前,分桶关键字在后
partitioned by
clustered by(id) into 8 buckets
----------------------------------------------
注意:
分区使用的是表外字段,分桶使用的是表内字段
分桶更加细粒度的管理数据,更多的是使用来做抽样、join
----------------------------------------------
--查询:
select * from tablesample(bucket x out of y on id)
x不能大于y。
x:从哪一个桶开始取数
y:总的桶数,一般是总桶数的倍数或者因子
总:分桶字段模于y后+1等x的值将会被查询出来。
001 002 003 004
1%4=1 1+1=2放进第二个桶,第一个桶放莫为0
查询:
select * from buc2;
select * from buc2 tablesample(bucket 1 out of 1 on id);
查询某通数据:
select * from buc2 tablesample(bucket 1 out of 4 on id);
select * from buc2 tablesample(bucket 2 out of 4 on id);
select * from buc2 tablesample(bucket 1 out of 2 on id);
select * from buc2 tablesample(bucket 2 out of 2 on id);
select * from buc2 tablesample(bucket 1 out of 8 on id);
select * from buc2 tablesample(bucket 2 out of 3 on id);
select
*
from buc2
tablesample(bucket 1 out of 4 on id)
;
select
*
from buc2
tablesample(bucket 2 out of 3 on id)
where id > 5
;
随机:
select * from buc1 order by rand() limit 3;
select * from buc1 tablesample(3 rows);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(30 percent);
select * from buc1 tablesample(3M); --B\K\M\G
查询语句:
select
from
join
on
where
group by
[grouping sets/with cube/with rollup]
having
order by/sort by
limit
union/union all
;
执行顺序:
FROM
<left_table>
ON
<join_condition>
<join_type>
JOIN
<right_table>
WHERE
<where_condition>
GROUP BY
<group_by_list>
HAVING
<having_condition>
SELECT
DISTINCT
<select_list>
ORDER BY
<order_by_condition>
LIMIT
<limit_number>
尽量不要使用子查询、尽量不要使用 in 和 not in
查询尽量避免join连接查询,但是这种操作咱们是永远避免不了的。
查询永远是小表驱动大表(永远是小结果集驱动大结果集)
内连接(inner join)
外连接(outer join)
左连接(left join)
右连接(right join)
全连接(full join)
常用:
left join
inner join
left join:以左表为基本表,右表关联不上用null替代
left join \ left semi join \ left outer join
left semi join 和 left join 区别:
1、都是左表连接,但是semi join右表关联不左表也不会出来,left join不一样
2、semi join只能查询左表信息,left join可以查询所有
3、semi join是left join的一种优化
4、semi join一般使用查询存在的情况
案例:
准备数据
1,a
2,b
3,c
4,d
7,y
8,u
2,bb
3,cc
7,yy
9,pp
create table if not exists u1(
id int,
name string
)
row format delimited fields terminated by ','
;
create table if not exists u2(
id int,
name string
)
row format delimited fields terminated by ','
;
load data local inpath '/home/hivedata/u1' into table u1;
load data local inpath '/home/hivedata/u2' into table u2;
u1表中存在,但是b表中并不存在的:
set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left join u2 u2
on u1.id = u2.id
where u2.id is null
;
set hive.exec.mode.local.auto=true;
select
u1.*
from u1 u1
left outer join u2 u2
on u1.id = u2.id
where u2.id is null
;
set hive.exec.mode.local.auto=true;
select
u1.id
from u1 u1
left semi join u2 u2
on u1.id = u2.id
;
right semi join在hive中不支持,其它的right join\right outer join都支持。
full outer join:取并集
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
full outer join u2 u2
on u1.id = u2.id
;
inner join(join 和 多表加,) :
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1
inner join u2 u2
on u1.id = u2.id
;
set hive.exec.mode.local.auto=true;
select
u1.*,
u2.*
from u1 u1, u2 u2
where u1.id = u2.id
;
小表标识:
hive提供小表标识:使用的是STREAMTABLE(小表别名)
set hive.exec.mode.local.auto=true;
select
/*+STREAMTABLE(d)*/
d.name,
e.name
from u1 e
join u2 d
on d.id = e.id
;
子查询:
支持不是特别好,对 = 支持不好
select
e.*
from u1 e
where e.id = (
select
d.id
from u2 d
where d.id = 2
)
;
select
e.*
from u1 e
where e.id in (
select
d.id
from u2 d
limit 1
)
;
inner join 和outer join的区别:
分区字段对outer join 中的on条件是无效,对inner join 中的on条件有效
有inner join 但是没有full inner join , 有full outer join但是没有outer join
所有join连接,只支持等值连接(= 和 and )。不支持 != 、 < 、> 、 <> 、>=、 <= 、or
map端join:
hive1.x版本默认开启,可以适当修改小表大小。
set hive.auto.convert.join=true; #是否开启map端join的优化
set hive.mapjoin.smalltable.filesize=25000000; #小表大小, 约23.8MB
案例:
select
e.*
from u1 d
join u2 e
on d.id = e.id
;
create table if not exists comment(
id string,
commentid string,
dt string,
comm string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/wb.db/ods_bhv_comment'
;
set hive.exec.mode.local.auto=true;
select
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;
mapjoin的标识:
select
/*+MAPJOIN(c)*/
c.id,
count(id) cnt
from comment c
group by c.id
order by cnt desc
limit 3
;
**ps:**where后不能跟聚合函数,普通函数可以
group by :分组,通常和聚合函数搭配使用。
如果语句带group by子句,则 select后面的字段要么在group by中出现,要么在聚合函数里面。
一般有group by出现将会有reducetask。
Having : 分组以后的结果再次过滤。
limit : 取多少条。默认100000条,文件个数10个文件
**cluster by :**兼有distribute by以及sort by的升序功能。
排序只能是升序排序(默认排序规则),不能指定排序规则为asc 或者desc。
set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
cluster by id
limit 10
;
set hive.exec.mode.local.auto=true;
select
id,
commentid
from comment c
distribute by id
sort by id asc
limit 10
;
distribute by : 根据by后的字段和reducer个数,决定map的输出去往那个reducer。
默认使用查询的第一列的hash值来决定map的输出去往那个reducer。如果reducer的个数为1时没有任何体现。
sort by:局部排序,只保证单个reducer有顺序。
order by:全局排序,保证所有reducer中的数据都是有顺序。
如果reduser个数只有一个,两者都差不多。
两者都通常和 desc 、 asc 搭配。默认使用升序asc。
手动设置reducer个数:
set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
order by e.id desc
;
set mapreduce.job.reduces=2;
set hive.exec.mode.local.auto=true;
select
e.id,
e.name
from u1 e
sort by e.id desc
;
order by的缺点:
由于是全局排序,所以所有的数据会通过一个Reducer 进行处理,当数据结果较大的时候,一个Reducer 进行处理十分影响性能。
sort by的特点:
常用于局部分组的时候的排序。
注意事项:
当开启MR 严格模式的时候ORDER BY 必须要设置 LIMIT 子句 ,否则会报错
只要使用order by ,reducer的个数将是1个。
如果sort by 和 distribute by 同时出现:那个在前面??
如果sort by 和 distribute by 同时出现,并且后面的字段一样、sort by使用升序时 <==> cluster by 字段
union 和 union all :
union :将多个结果集合并,去重,排序
union all :将多个结果集合并,不去重,不排序。
set mapreduce.job.reduces=1; --设置reduce的数量为一
select
d.id as deptno,
d.name as dname
from u1 d
union
select
e.id as deptno,
e.name as dname
from u2 e
;
set mapreduce.job.reduces=1;
select
d.id as deptno,
d.name as dname
from u1 d
union all
select
e.id as deptno,
e.name as dname
from u2 e
;
单个union 语句不支持:orderBy、clusterBy、distributeBy、sortBy、limit
单个union语句字段的个数要求相同,字段的顺序要求相同、字段类型需要一样(union)。
union子句不支持group by\distribute by\order by (每个union子句嵌套中可以使用)
select
*
from (
select
a.deptno,
a.dname
from(
select
d.id as deptno,
d.name as dname
from u1 d
limit 1 ) a --使用union / union all 的时候这两个里面直接嵌套limit,group by 这些语句会报错
union all --可以在要使用的表的外层重新嵌套一个表来实现。如左边所示
select
e.id as deptno,
e.name as dname
from u2 e
) tmp
;
distinct : 去重
select
id,
count(distict commentid) cnt
from comment c
group by id
limit 3
;
hive的 数据类型:
基础数据类型:
tinyint 1 -128~127
smallint 2 -2的15 ~ 2的15-1
int 4
bigint 8
float 4
double 8
boolean 1 true/false
string
binary 字节
timestamp 2017-06-02 11:36:22
java中有的而hive中没有的:
long
char
short
byte
create table if not exists bs1(
id1 tinyint,
id2 smallint,
id3 int,
id4 bigint,
sla float,
sla1 double,
isok boolean,
content binary,
dt timestamp
)
row format delimited fields terminated by '\t'
;
233 12 342523 455345345 30000 60000 nihao helloworld 2017-06-02
126 13 342526 455345346 80000 100000 true helloworld1 2017-06-02 11:41:30
load data local inpath '/home/hivedata/bs' into table bs1;
复杂数据类型:
array : col array<基本类型> ,下标从0开始,越界不报错,以NULL代替
map : column map<string,string>
struct: col struct
array案例:
zhangsan 78,89,92,96
lisi 67,75,83,94
create table if not exists arr1(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
;
注意terminated顺序??
create table if not exists arr2(
name string,
score array<String>
)
row format delimited fields terminated by '\t'
collection items terminated by ','
;
导入数据
load data local inpath '/home/hivedata/arr1' into table arr1;
查询:
select * from arr1;
select
*
from arr2
where score[1] > 80;
select
name,
score[1],
score[20]
from arr2
where score[1] > 80;
select
name,
score
from arr2
where size(score) > 3;
扩展查询:
zhangsan 78,89,92,96
zhangsan 78
zhangsan 89
zhangsan 92
zhangsan 96
内嵌查询:
explode:展开
select name,explode(score) score from arr2;
lateral view : 虚拟表
select
name,
sum(cj) sum_cj
from arr2
lateral view explode(score) score as cj
group by name
;
统计每个学生的总成绩:
select name,sum(cj) as totalscore from arr1 lateral view explode(score) score as cj group by name;
如何往array字段写入数据:
准备数据:
create table arr_temp
as
select name,cj from arr2 lateral view explode(score) score as cj;
1、collect_set函数:
create table if not exists arr3(
name string,
score array<int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
;
将数据写成array格式:
insert into arr3
select
name,
collect_set(cast(cj as int))
from arr_temp
group by name;
insert into arr3
select
name,
collect_list(cast(cj as int))
from arr_temp
group by name;
map案例:
zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:
wangwu chinese:89,math:,english:81,nature:9
create table if not exists map2(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':'
;
加载数据
load data local inpath '/home/hivedata/map' into table map2;
查询:
查询数学大于35分的学生的英语和自然成绩:
select
m.name,
m.score['english'] ,
m.score['nature']
from map2 m
where m.score['math'] > 35
;
展开数据:
explode
select explode(score) as (m_class,m_score) from map2;
使用lateral view explode结合查询:
select
name,
collect_list(concat_ws(":",m_class,cast(m_score as string)))
from map2
lateral view explode(score) score as m_class,m_score
group by name;