Hive表操作--分区表与分桶表

এ 运筹vivo

已于 2024-04-10 09:55:21 修改

阅读量205

点赞数 7

文章标签： hive 大数据 big data sql

于 2024-04-09 21:01:47 首次发布

本文链接：https://blog.csdn.net/weixin_65265621/article/details/137566376

版权

一、Hive的第1代客户端的作用

1. 第1代客户端

Hive的第1代客户端指的是:  Linux路径下  /export/server/hive/bin/hive   这个脚本.
它的作用有 4 个:
    1. 可以启动元数据服务 和 hiveserver2服务.
        nohup hive --service metastore &
        nohup hive --service hiveserver2 &
    2. 可以充当客户端使用, 编写并执行HQL代码.
    3. 可以临时执行一次 1个 HQL语句.
        hive -e 'Hive SQL语句'    -- execute: 执行的意思.
    4. 可以临时执行一次 1个 HQL脚本.
        hive -f hive脚本          -- 建议后缀名为.hql   file: 文件

2. 案例

-- ------------------------- Hive的第1代客户端的作用 --------------------------
-- 1. hive -e的方式 导出数据,  注意: 如下的命令要在Linux中执行.
-- hive -e 'select * from day06.t_all_hero;'  > 1.txt        把HQL执行结果 覆盖写入到 1.txt文件中, 注意: 在Linux中执行.

-- 2. hive -f的方式 导出数据. 注意: 脚本后缀名最好是 hql
 -- hive -f my.xyz >> 1.txt     把my.xyz脚本的执行结果, 追加到1.txt文件中.
 -- hive -f my.hql >> 1.txt     hive sql脚本文件的后缀名最好为: hql

二、分区表-由来

-- ----------------------------- 准备动作 -----------------------------
-- 1. 建库.
create database day07;
-- 2. 切库.
use day07;
-- 3. 查表.
show tables;

-- -------------------------- 案例1: HQL DDL语句--分区表的由来 -----------------------
-- 需求: 创建t_all_hero表, 存储所有的英雄数据, 要求: 不使用分区表.
-- 1. 建表.
create table t_all_hero(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
row format delimited fields terminated by '\t';

-- 2. 上传(6个)源文件到该hive表的HDFS路径下.
-- 3. 查看表数据.
select * from t_all_hero;

-- 思考: 目前我们已经把所有的数据放到了一张表中, 思考, 这种方式好吗?
-- 4. 新需求: 查询出所有的射手数据.
select * from t_all_hero where role_main='archer';
-- 遇到的问题: 虽然我们实现了需求, 但是需要进行全表扫描, 如何精准的获取到我们想要的数据呢?
-- 答案: 可以采用分区表的思路来管理, 把各个职业的数据放到不同的文件夹中即可.
-- 例如: role=sheshou, 这个文件夹中都是射手的数据
-- role=cike, 这个文件夹中都是刺客的数据, 这样我们检索时, 就可以避免全表扫描, 减少扫描次数, 提高查询效率.

三、分区表-静态分区

-- --------------------- 案例2: HQL DDL语句--分区表-单级分区 静态分区 -------------------
-- 1. 创建分区表, 指定分区字段.
create table t_all_hero_part(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
partitioned by (role string comment '角色字段-充当分区字段')  -- 核心细节: 分区字段必须是表中没有的字段.
row format delimited fields terminated by '\t';

-- 2. 请问, 上述的分区表, 一共有多少个字段.
desc t_all_hero_part;   -- 10个, 9个基础字段 + 1个分区字段

-- 3. 如何往分区表中添加数据呢?
-- 方式1: 静态分区, 即: 手动指定分区字段 和 字段值.   类似于: 静态ip
-- 细节: partition(role='sheshou') 静态分区
load data local inpath '/export/hivedata/archer.txt' into table t_all_hero_part partition(role='sheshou');
load data local inpath '/export/hivedata/assassin.txt' into table t_all_hero_part partition(role='cike');
load data local inpath '/export/hivedata/mage.txt' into table t_all_hero_part partition(role='fashi');
load data local inpath '/export/hivedata/support.txt' into table t_all_hero_part partition(role='fuzhu');
load data local inpath '/export/hivedata/tank.txt' into table t_all_hero_part partition(role='tanke');
load data local inpath '/export/hivedata/warrior.txt' into table t_all_hero_part partition(role='zhanshi');

-- 方式2: 动态分区, 即: 手动指定分区字段即可, 分区字段值一样的数据, 会被放到一起.    类似于: 动态ip
-- 细节: partition(role)  这就是动态分区
-- 下个案例讲解.

-- 4. 查询分区表的数据.
select * from t_all_hero_part;

-- 5. 回顾刚才的新需求: 查询出所有的射手数据.
-- 细节: 如果你设置了分区表, 查询时, 建议带上分区字段.
select * from t_all_hero_part where role_main='archer';     -- 依旧进行了全表扫描.
select * from t_all_hero_part where role='sheshou';         -- 精准扫描某个分区(目录), 避免全表扫描.

四、分区表-动态分区

1. 动态分区

之前我们已经实现了静态分区, 即: 手动指定分区字段 和 分区字段值, 如果分区过多, 每次写分区字段值比较繁琐, 且有可能写错.
如何解决这个问题呢?
非常简单, 可以通过 动态分区的方式实现, 即: 手动指定分区字段即可, 该字段值一样的数据, 会被自动放到一起.

细节:
    1. 在进行动态分区的时候, 建议: 手动关闭严格模式.
    2. 分区的严格模式要求: 在进行动态分区的时候, 至少要有1个静态分区. 如果都是动态分区, 则: 报错.
       静态分区:   partition(role='sheshou')        partition(分区字段='值')
       动态分区:   partition(role)                  partition(分区字段)
    3. 动态分区不支持load方式加载数据, 采用 insert into | overwrite方式导入数据.
    4. set 参数名=值; 是在 设置参数值.
       set 参数名;    是在 是在获取(查看)该参数的值.

2. 案例–动态分区

-- --------------- 案例3: HQL DDL语句--分区表-单级分区 动态分区 --------------------
-- 1. 创建分区表.
create table t_all_hero_part_dynamic(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
partitioned by (role string comment '角色字段-充当分区字段')  -- 核心细节: 分区字段必须是表中没有的字段.
row format delimited fields terminated by '\t';

-- 2. 尝试用load方式加载数据.
-- 报错, 动态分区不支持load语法, 只支持 insert into | overwrite语法.
load data local inpath '/export/hivedata/archer.txt' into table t_all_hero_part_dynamic partition(role);

-- 3. 关闭严格模式.
set hive.exec.dynamic.partition.mode=nonstrict;     -- nonstrict 非严格模式, strict: 严格模式(默认)

-- 4. 动态分区的方式, 添加数据.
insert into table t_all_hero_part_dynamic partition(role)
select *, role_main from t_all_hero;

-- 5. 查询分区表的数据.
select * from t_all_hero_part_dynamic;

-- 5. 回顾刚才的新需求: 查询出所有的射手数据.
-- 细节: 如果你设置了分区表, 查询时, 建议带上分区字段.
select * from t_all_hero_part_dynamic where role_main='archer';     -- 依旧进行了全表扫描.
select * from t_all_hero_part_dynamic where role='sheshou';         -- 精准扫描某个分区(目录), 避免全表扫描.

五、分区表–相关操作

1. 注意点

到目前为止, 我们已经学会了静态分区和动态分区, 但是上述的分区案例都是单级分区, 实际开发中, 如果数据量比较大的情况下,
可以考虑采用多级分区的思路来解决, 多级分区一般用时间来分区, 可以是: 年, 月, 日…

细节: 多级分区的时候, 分区层级不建议超过 3级, 一般是: 年, 月2级就够了.

2. 案例–多级分区

-- --------------------- 案例4: HQL DDL语句--分区表-多级分区和分区常见语法 ---------------
-- 1. 创建商品表, 按照: 年, 月分区.
create table products(
    pid int,
    pname string,
    price int,
    cid string
) comment '商品表'
partitioned by (year int, month int)        -- 按照年, 月分区, 2级分区
row format delimited fields terminated by ',';

-- 2. 查看所有的分区信息.
show partitions products;

-- 3. 手动添加分区.
-- 3.1 添加1个分区.
alter table products add partition(year=2023, month=1);
-- 3.2 添加多个分区.
alter table products add partition(year=2023, month=4) partition(year=2023, month=5) partition(year=2023, month=11);
alter table products add partition(year=2024, month=1) partition(year=2024, month=5) partition(year=2024, month=10);

-- 4. 修改分区.  例如: 把2024年10月 => 2024年8月
alter table products partition(year=2024, month=10) rename to partition(year=2024, month=08);

-- 5. 删除分区.
alter table products drop partition(year=2023, month=4);    -- 删除2023年4月 这个分区
alter table products drop partition(month=1);               -- 删除所有的1月 这个分区
alter table products drop partition(year=2023);             -- 删除2023年 及其所有的子分区.

-- 6. 查看商品表数据.
select * from products;

-- 7. 精准查看某个分区的数据.
select * from products where year=2023;             -- 查找2023年 分区内, 所有的数据.
select * from products where year=2023 and month=1; -- 查找2023年, 1月 分区内, 所有的数据.

六、分桶表-入门

1.分通表

分桶表介绍:
    概述:
        分桶 = 分文件, 相当于把数据 根据分桶字段, 拆分成N个文件.
    作用:
        1. 方便进行数据采样.
        2. 减少join的次数, 提高查询效率.
    细节:
        1. 分桶字段必须是表中已有的字段.
        2. 分桶数量 = HDFS文件系统中, 最终的文件数量.
        3. 分桶规则用的是: 哈希取模分桶法, 简单来说, 就是根据分桶字段计算它的哈希值, 然后和桶的个数取余, 余数为几, 就进哪个桶.
            哈希值: 程序根据值的内容, 内存地址值等信息, 计算出来的1个数字.
                例如:
                    select hash('乔峰');          哈希值为: -870432061
                    select hash('乔峰') % 3;      取余结果为: -1           3是假设一共有3个桶.
                    select abs(-10);             计算绝对值的: 10
                    select abs(hash('乔峰')) % 3; 取余结果为: 1
        4. set mapreduce.job.reduces=n;     可以设置ReduceTask任务的数量, 这个设置只在 分桶查询中会用到.
           分桶建表的时候, 不用该参数.
        5. 分桶表的数据不建议load data方式 或者 手动上传, 而是: insert into | overwrite的方式添加.

2. 案例–分桶表+不排序

-- -------------------------- 案例5: HQL DDL语句--分桶表 不排序 ------------------------
-- 1. 创建普通的学生表(充当数据源表), 上传源文件, 然后查看数据.
create table student(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
row format delimited fields terminated by ',';

select * from student;

-- 2. 创建分桶表-学生表, 按照学生id进行分桶, 分成 3 个桶.
create table student_buckets(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
clustered by (sid) into 3 buckets       -- 按照学生id进行分桶, 分成 3 个桶.  即: 根据sid的哈希值 和 3取余, 余数为几, 就进哪个桶.
row format delimited fields terminated by ',';

-- 3. 往 分桶表中插入数据.
insert into table student_buckets select * from student;

-- 4. 查看分桶表结果.
select * from student_buckets;

2. 案例–分桶表+排序

-- ----------------------------- 案例6: HQL DDL语句--分桶表 排序 ------------------------
-- 1. 创建分桶表-学生表, 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 age 降序排列.
drop table student_buckets_sort;
create table student_buckets_sort(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
-- clustered by (sid) sorted by (sid) into 3 buckets      -- 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 sid 升序排列.
clustered by (sid) sorted by (age desc) into 3 buckets    -- 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 age 降序排列.
row format delimited fields terminated by ',';

-- 2.往上述的分桶排序表中, 添加数据.
insert into student_buckets_sort select * from student;

-- 3. 查询结果.
select * from student_buckets_sort;

七、分桶表–分桶规则

-- ----------------------------- 案例7: HQL DDL语句--分桶表 分桶规则 ------------------------
-- 分桶规则:  分桶字段的哈希值(绝对值形式) % 桶的个数,    余数是几, 就进哪个分桶.
-- 加密算法.
select md5('pwd111');       -- 130353326a7bfab601f57757033b5b4a
select sha1('pwd111');      -- 5434831585d4109e2fba9b12a2ca26f8e1734c2e
select sha2('pwd111', 256);      -- 07c9f005ea0fb70fc48531e25a3a13e434e721dbc858c0b799568c3cafb2534c
select sha2('pwd111', 512);      -- 56c6ac5aec911eda1ff78fde625d26a49b1a43a5770dd5059e84e051ffda3e0a66e8b05dd724bd7e57631e395ddccb9a2dd7eb6ff8bfd8e17c765c2316df8365
select crc32('pwd111');          -- 4152129734

-- 查看函数的说明文档.
describe function extended sha2;

-- 获取哈希值.
select hash(123);       -- 整数的哈希值, 是本身.
select hash('乔峰');     -- 哈希值: -870432061
select hash('乔峰') % 3; -- -1
select abs(-10);        -- 计算绝对值的, 10

-- 分桶规则如下
select abs(hash('乔峰')) % 3;     -- 1

八、HQL DDL语句–复杂类型建表及解析

1. array: 数组(列表, 序列)类型

-- ------------------ 案例8: HQL DDL语句--复杂类型 array 数组类型 ---------------------
-- 数据格式为: "zhangsan	beijing,shanghai,tianjin,hangzhou"
-- 1. 建表.
create table t_array(
    name string comment '姓名',
    city array<string> comment '出差城市'   -- array类似于Python的列表, 就是容器类型.  <string>意思是: 泛型, array中只能存储字符串类型数据.
)
row format delimited fields terminated by '\t'  -- 切割后, 数据格式为: "zhangsan",  "beijing,shanghai,tianjin,hangzhou"
collection items terminated by ',';             -- 切割后, 数据格式为: "zhangsan",  ["beijing", "shanghai" ,"tianjin", "hangzhou"]

-- 2. 上传源文件, Hive表自动解析.

-- 3. 查看表结果.
select * from t_array;

-- 4. 完成如下的需求.
-- 查询所有数据
select * from t_array;

-- 查询city数组中第一个元素
select name, city[0] from t_array;     -- 字段名[索引], 索引从刚开始计数.

-- 查询city数组中元素的个数
select name, city, size(city) from t_array;     -- size(复杂类型) 可以查看该复杂类型的元素个数.

-- 查询city数组中包含tianjin的信息
-- 格式: array_contains(复杂类型字段, 要判断的值) 判断该值是否在 该复杂类型的变量中, True: 在,  False: 不在.
select name, city, array_contains(city, 'tianjin') from t_array;
-- 只获取 city列 包含 'tianjin' 的数据
select name, city from t_array where array_contains(city, 'tianjin') = true;
-- 最终版写法: 因为 array_contains()函数返回结果就是True 或者False, 所以可以直接写.
select name, city from t_array where array_contains(city, 'tianjin');       -- 要city列 包含tianjin的
select name, city from t_array where not array_contains(city, 'tianjin');   -- 要city列 不包含tianjin的

2. struct: 结构体

-- ----------------- 案例9: HQL DDL语句--复杂类型 struct 结构体类型 -------------------
-- 结构体类型较之于数组类型, 区别是: 结构体类型可以设置 子列的类型和名称.
-- 对比为: array<string>,    struct<name:string, age:int>
-- 相同点是: 都可以存储多个元素.
-- 1. 建表.       数据源格式为: "1#周杰轮:11"
create table t_struct(
    id int,
    info struct<name:string, age:int>
)
row format delimited fields terminated by '#'  -- 切割后: "1", "周杰轮:11"
collection items terminated by ':';     -- collection items 是负责切割: 数组, 结构体的

-- 2. 上传源文件.

-- 3. 查看表数据.
select * from t_struct;

-- 4. 细节: 列名.子列名 可以获取子列的信息.
select id, info.name, info.age from t_struct;

3. map: 映射

-- ------------------------ 案例10: HQL DDL语句--复杂类型 map 映射类型 ----------------
-- map映射类型: 类似于Python的字典, Java中的Map集合, 存储的是键值对数据, 左边的叫: 键(key), 右边的叫: 值(value)
-- 数据格式为:  "1,林杰均,father:林大明#mother:小甜甜#brother:小甜,28"
-- 1. 建表.
create table t_map(
    id int comment '编号',
    name string comment '姓名',
    members map<string, string> comment '家庭成员',     -- 左边的string: 键的类型,  右边的string: 值的类型.
    age int comment '年龄'
)
row format delimited fields terminated by ','  -- 切完后, 数据为: 1, "林杰均", "father:林大明#mother:小甜甜#brother:小甜", 28
collection items terminated by '#'  -- 切完后, 数据为: 1, "林杰均", ["father:林大明", "mother:小甜甜", "brother:小甜"], 28
map keys terminated by  ':';  -- 切完后, 数据为: 1, "林杰均", {"father" : "林大明", "mother" : "小甜甜", "brother" : "小甜"}, 28
-- 2. 上传源文件.

-- 3. 查看表数据.
select * from t_map;


-- 4. 完成如下的需求.
-- 4.1 查询全部
select * from t_map;
-- 4.2 查询father、mother这两个map的key
select id, name, age, members['father'], members['mother'] from t_map;
-- 4.3 查询全部map的key，使用map_keys函数，结果是array类型
select *, map_keys(members) from t_map;
-- 4.4 查询全部map的value，使用map_values函数，结果是array类型
select *, map_values(members) member_value from t_map;      -- 给列起别名, as可以省略不写.
select *, map_values(members) `值` from t_map;              -- 如果列名和关键字重名了, 或者有中文, 加上: 反引号``即可.
-- 4.5 查询map类型的KV对数量
select *, size(members) from t_map;

-- 4.6 查询map的key中有brother的数据
-- step1: 获取所有的键.
select *, map_keys(members) from t_map;
-- step2: 判断是否包含 brother 这个"键"
select *, array_contains(map_keys(members), 'brother') from t_map;
-- step3: 筛选出指定的数据即可.
select * from t_map where array_contains(map_keys(members), 'brother');

-- 查看hive中所有的函数.
show functions ;

-- 查看函数的说明文档及测试用例.
describe function extended abs;     -- abs是函数名

4. union: 联合体

-- union联合查询--HQL DQL语句
union 联合查询: 对表数据做纵向拼接的, 类似于书本的 上, 下册, 有两种拼接方式.
--HQL DQL语句: 详解见**Hive表操作----HQL**篇中

এ 运筹vivo

关注

7
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
Hive表操作--分区表与分桶表

Hive表操作--分区表与分通表详情
复制链接

扫一扫