Hive模块

我要改名字qWq

已于 2022-03-25 15:00:16 修改

阅读量247

点赞数

分类专栏：面试-2021-02 文章标签： hive

于 2021-02-20 21:32:45 首次发布

本文链接：https://blog.csdn.net/wenqiangW_/article/details/113897489

版权

面试-2021-02 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

一.分区表和分桶表

分区表 - 分区字段存储在hdfs的目录结构上
表现形式：address为分区字段
     /user/hadoop/test.db/stu/address=beijing
     /user/hadoop/test.db/stu/address=shanghai
 /user/hadoop/test.db/stu/address=wuhan


分桶表 - 对分桶字段按照分桶个数进行hash，将数据均匀分布在每一个桶表里面
两个分桶表进行关联时，他们的分桶个数需要是倍数关系
表现形式：按照age分桶，分桶个数为3
    原始：/user/hadoop/test.db/stu
    分桶： age   3  按照年龄分3个桶   由一个存储文件变为三个存储文件
    /user/hadoop/test.db/stu/part-r-00000   age%3=0
    /user/hadoop/test.db/stu/part-r-00001   age%3=1
    /user/hadoop/test.db/stu/part-r-00002   age%3=2

二.DDL - 数据定义

CREATE  [EXTERNAL]  TABLE  [IF NOT EXISTS]  table_name
[( col_name   data_type   [COMMENT col_comment], ... ) ]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ... ) ]
[CLUSTERED BY (col_name, col_name, ... ) [SORTED BY (col_name [ASC|DESC], ... ) ] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format] 
[LOCATION hdfs_path]

三.DML - 数据操作

load data [local] inpath path [overwrite] into table tablename;
load data [local] inpath path into table tablename partition（分区字段=字段值）；

静态导入
insert into table tablename partition(分区字段=字段值) select ... from .. where ...

动态导入
insert into table tablename partition(分区字段) select ... 分区字段 from .. where ...

四.查询操作

full outer join - 能关联上就关联，关联不上就置空
left join - 以左表为主，显示左表所有数据，左表关联不上的显示为null
right join - 以右表为主，显示右表所有数据，右表关联不上的显示为null
inner join - 只显示两个表能关联上的数据

in  可以用 left  semi join 实现 也可以用 left join  where b.**  is not null
not in 实现：left join  where b.** is null

where和having
where只在分组之前过滤，having是过滤分组统计后的内容

排序比较：
sort by   局部排序，对每一个reduce中的数据进行排序，只有一个reduce时=order by 
order by   全局排序
cluster by   =distribute by 字段 sort by 字段 （两个字段相同）
distribute by  distribute by(分桶字段) +sort by （局部排序)   需要手动设定reducetask的数量

五.数据类型

ids array<string>,    
event_info map<string,string>,
info struct<name:string , addredd:string>

使用：
ids[ 0 ]
event_info[category_code]
info.name



array
create table if not exists test_array(id int,work_add   array<string>) 
row format delimited
fields terminated by '\t' 
collection items terminated by ',' ;

map
create table if not exists test_map(id int , piaofang  map<string,int>) 
row format delimited
fields terminated by '\t' 
collection items terminated by ','  --数组元素分隔符
map keys terminated by ':'  ;       --kv分隔符

六.函数操作

show functions like ;

查看函数的详细描述信息
desc function extended funname;

炸裂函数
array:
   select
   id,col1.col2
   from test_array
   lateral view explode(array_column) col1(视图别名) as col2(字段别名);

1,['aaa','bbb'] -> 1,'aaa'
                   1,'bbb'
map:
   select
   id,col1.col2,col1.col3
   from test_map
   lateral view explode(map_column) col1(视图别名) as col2(key别名),col3(value别名);

开窗函数:
        distribute by 开窗字段 + sort by 窗口内排序字段  
        partition by 开窗 + order by 窗口内排序字段  
eg:
每个部门内部按照age进行排序
select id,name,sex,age,dept,
row_number()over(partition by dept order by age desc) index
from stu_managed;

求每个部门截止到自己的id为止的最大年龄，加上了sort是截止目前进行排序
select id,name,sex,age,dept,
max(age) over(distribute by dept sort by id) agemax
from stu_managed;

七.json解析

八.hive的执行过程

几个核心语句的转换：
group by --- mapreduce中map的key

分组字段   mapreduce中默认map-key就是分组字段
将分组字段-- mapkey
select
uid,datetime,sum(viewcount) totalcount
from mianshi01
group by uid,datetime
map端：
    每次只能获取一条数据
    key： uid datetime
    value: viewcount
reduce端：
    相同分组的在一起
    sum（viewcount）
group by的分组字段就是map端的key


join --- 分为mapjoin和reducejoin
select * from a join b on a.id=b.id;

reducejoin:
map端：
    key：id
    value：标识+其他字段
reduce端：
    拼接不同表的数据
    
mapjoin:
    将其中一个小表放在加载到每一个task节点的缓存中
    map端读取
        setup   读取缓存中的表  map<关联建，其他>
        map：
            读取另一个表  进行关联
            
distinct --- 去重字段为map的key
select distinct dept from stu;

每组内取一个
将去重字段放在map的key
map：
    key：dept
    value:null
reduce:
    一组中取一个key
    
count --- 每一行的value为1，每一个maptask统计好后发送给一个reduce统计
select count(*) from stu;

map端：    
    key： null
    value: 1
    每一个maptask统计自己的行数，发送给reduce
reduce端：只需要一个reducetask
    累加迭代器