Day2-Hive的多字段分区，分桶和数据类型

笔记分享

于 2024-04-04 21:42:32 发布

阅读量1.4k

点赞数 45

分类专栏： Hive 文章标签： hive 数据仓库

本文链接：https://blog.csdn.net/m0_73450879/article/details/137383280

版权

Hive 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Hive

表结构

分区表

多字段分区：需要使用多个字段来进行分区，那么此时字段之间会构成多层目录，前一个字段形成的目录会包含后一个字段形成的目录，从而形成多级分类的效果。例如商品的大类-小类-子类，省市县、年级班级等

案例

原始数据

1 1 bob
1 1 amy
1 1 alex
1 2 david
1 2 cindy
1 2 bruce
1 3 balley
1 3 danniel
1 3 grace
2 1 henry
2 1 hack
2 1 grace
2 2 jack
2 2 john
2 2 lucy

多字段分区

-- 建立临时表
create table students_tmp (
    grade int,
    class int,
    name  string
) row format delimited fields terminated by ' ';
-- 加载数据
load data local inpath '/opt/hive_data/students' into table students_tmp;
-- 建立分区表
create table students (
    name string
) partitioned by (grade int, class int);
-- 开启动态分区
set hive.exec.dynamic.partition.mode = nonstrict;
-- 动态分区
insert into students partition (grade, class)
select name, grade, class
from students_tmp distribute by grade, class;
-- 查看数据
select * from students tablesample (5 rows);

分桶表

当数据量比较大，但是又需要对数据进行大致的、快速的分析的时候，此时可以考虑对数据进行抽样处理。但是抽样的字段和要分析的字段之间不能有关联
在Hive中，抽样方式非常多，其中一种方式就是对数据进行分桶：先计算分桶字段的哈希值，然后对桶的个数取余数，根据余数来决定将数据放入哪一个桶中
注意：在Hive3.1.3之前的版本中，分桶表不支持load方式，只能是使用insert方式来插入数据来进行分桶；从Hive3.1.3开始，支持load方式，但是load方式效率非常低而且可能会不分桶

案例

-- 在Hive中，分桶机制默认是不开启的，所以需要先开启分桶机制
set hive.enforce.bucketing = true;
select *
from heros;
-- 建立分桶表
-- 分了n个桶，就会产生n个ReduceTask，从而会产生n个结果文件
-- 所以桶数越多，产生ReduceTask越多，占用集群的资源就越多
create table hero_buckets (
    id      int,
    name    string,
    country string
) clustered by (name) into 4 buckets
    row format delimited fields terminated by ' ';
-- 向分桶表中插入数据
-- 根据name字段的值来分桶，在计算的时候，是先计算name字段的哈希码，对桶数取余，余数是几，就放入哪个桶
insert overwrite table hero_buckets
select id, name, country
from heros;
-- 从分桶表中来获取数据
-- bucket x out of y
-- 在Hive3.x中，x表示从第一个桶的第几条数据开始获取，y表示每几条数据来获取一次
-- bucket 1 out of 2表示从第一条数据开始获取，每2条获取一次 --- 获取的是1,3,5...
-- y必须是桶数的倍数或者因子
select *
from hero_buckets tablesample (bucket 1 out of 2 on name);

数据类型

概述

Hive提供非常多的数据类型，分为两类：基本类型和复杂类型
基本类型

Hive类型 Java类型
tinyint byte
smallint short
int int
bigint long
float float
double double
boolean boolean
string String
binary byte[]
timestamp Timestamp
复杂类型主要有三个：array、map和struct

Hive类型	Java类型
tinyint	byte
smallint	short
int	int
bigint	long
float	float
double	double
boolean	boolean
string	String
binary	byte[]
timestamp	Timestamp

array类型

array：数组，对应了Java中的数组或者集合

案例

原始数据

1 amy,bob tom,simon,peter
2 lucy,lily,jack thomas,tony
3 perl,john alex,adair,dell
4 hack,henry vincent,william,vivian

案例

-- 建表
create table battles (
    battle_id int,
    group_a   array<string>,
    group_b   array<string>
) row format delimited
    fields terminated by ' ' -- 字段之间使用空格隔开
    collection items terminated by ','; -- 数组元组之间用逗号隔开
-- 加载数据
load data local inpath '/opt/hive_data/battles' into table battles;
-- 查询数据
select *
from battles;
-- 查询a组成员
select group_a from battles;
-- 获取a组第一个成员
select group_a[0] from battles;
-- 获取a组第一个成员
select group_a[2] from battles where group_a[2] is not null;

map类型

map：映射，对应了Java中的映射

案例

原始数据

1 amy,19 lucy,18
2 david,18 alex,19
3 henry,18 hack,18

案例

-- 建表
create table members (
    id    int,
    mem_a map<string,int>,
    mem_b map<string,int>
) row format delimited
    fields terminated by ' '
    map keys terminated by ',';
-- 加载数据
load data local inpath '/opt/hive_data/members' into table members;
-- 查询数据
select * from members;
-- 查询成员b的信息
select mem_b from members;
-- 查询hack的信息
select mem_b['hack'] from members where mem_b['hack'] is not null;