hive的语法-HiveDDL分区/分桶

最新推荐文章于 2023-12-21 14:23:50 发布

LIUERTOU

最新推荐文章于 2023-12-21 14:23:50 发布

阅读量343

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/LIUERTOU/article/details/120778340

版权

hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

hive中的语法

hive数据类型

原生数据类型

数值型
int, float,
字符型
string
日期型
data
布尔型
bool(ture/false)

复杂数据类型

array 数组
map k-v
struct 结构体 {int, string,…}
联合体

数据类型转换

隐式转换
显示转换
select cast(‘100’ as int)

hive文件的读写

定义: Hadoop中的文件数据和hive的表之间的关系

#自定义
delimited
#自己指定
serbe

read过程:反序列化

将文件数据映射到表上:
将sql语句转化为mp程序，找到对应的数据文件，按照映射表指定的格式去切割获取数据，然后将数据安装指定的字段形式进行返回

write过程:序列化

将表上的数据写入到文件:
按照指定的的格式将数据写入文件

数据格式规范指定

规范在读写数据过程中按照指定格式操作数据

字段间的格式
zhangsan wangwu
集合元素之间的
zhangsan, gender:boy-age:18
map映射
key : value

数据库和表的删改操作


-- 强制删除不为空的数据库
drop database hive cascade ;


-------------------------------------------

-- 把外部表改为内部表
alter table team_player
    set tblproperties ('EXTERNAL' = 'FALSE');
    
drop table team_player;
-- 把内部表转化为外部表
alter table students_txt
    set tblproperties ('EXTERNAL' = 'TRUE');

desc formatted students_txt;

-- 删除外部表students_txt
drop table students_txt;

hiveDDL

①先创建表,指定数据的处理形式后才能进行相应的数据操作

#创建表字段形式,指定分割符号
create table tb_archer
(
    id           int,
    name         string comment '英雄名称',
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range int,
    role_main    string,
    role_assist  string
) row format delimited fields terminated by '\t';

#指定分割形式
fields terminated by

#上传文件后,查看表内容
select * from tb_archer;

----------------------------------------
#创建表,复杂数据类型的处理
create table tb_hot_hero_skin_price
(
    id         int,
    name       string,
    win_rate   int,
    skin_price map<string,int>
) row format delimited fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':';

---------------------------------------
#创建表,默认字符分割
create table tb_team_ace_player
(
    id              int,
    team_name       string,
    ace_player_name string
);

hive表的类型

内部表
在数据文件不存在的时候,提前创建数据表字段,然后将数据传入对应的目录中

默认情况下在没有指定external关键词的情况下都是内部表

内部表管理元数据和表数据,一旦删除后之后,元数据和表数据全部清空

外部表
数据已经存在,对存在数仓上的数据建表后进行操作,由于存储的位置与默认路径不一致,所以需要location指定数据存储路径

外部表的关键词为external

外部表只管理元数据,删除外部表不会把hdfs上的数据文件删除,只会把元数据删除

外部表的创建


-- 外部表-location指定数据位置  external关键词
create external table student_txt
(
    id   int,
    name string,
    sex  string,
    age  int,
    dept string
)
    row format delimited fields terminated by ','
location '/python';

分区表

分区可以将多个文件划分成不同的文件目录,在进行查询是可以指定对应的目录,直接到对应的目录下完成数据查询

关键字: partition by (分区字段, 字段类型)
分区字段不能可定义的字段重复

分区表导入数据,根据导入的数据方式不同,分区表可以分为静态表和动态表

分区表注意事项

分区表不是建表的必要语法规则，是一种优化手段表，可选；
分区字段不能是表中已有的字段，不能重复；
分区字段是虚拟字段，其数据并不存储在底层的文件中；
Hive支持多重分区，也就是说在分区的基础上继续分区，划分更加细粒度

静态表的导入

load data local inpath ‘/root/指定文件路径’ into table 表名 partition(分区字段=‘分区值’)

#分区表字段的创建
create table if not exists tb_hero_part(
    id           int,
    name         string comment '英雄名称',
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range int,
    role_main    string,
    role_assist  string
    )partitioned by (role string) row format delimited fields terminated by ',';
-----------------------------------------
-- 查看表信息
desc formatted tb_hero_part;
-- 查看有几个分区数据信息
show partitions tb_hero_part;
------------------------------------------

#手动指定静态分区
load data local inpath '/root/hero/archer.txt' into table tb_hero_part partition (role = 'sheshou');
load data local inpath '/root/hero/assassin.txt' into table tb_hero_part partition (role = 'cike');
load data local inpath '/root/hero/mage.txt' into table tb_hero_part partition (role = 'fashi');
load data local inpath '/root/hero/support.txt' into table tb_hero_part partition (role = 'fuzhu');
load data local inpath '/root/hero/tank.txt' into table tb_hero_part partition (role = 'tanke');
load data local inpath '/root/hero/warrior.txt' into table tb_hero_part partition (role = 'zhanshi');

#用分区字段进行筛选
select count(*) from tb_hero_part where hp_max > 6000 and role_main='archer' and role='sheshou';

动态表导入数据

insert into table 表名字 partition(分区字段) select * from tmp_table
hive根据指定的数据自动进行分区，生成对应的分区目录和数据

#启动hive动态分区的设置
set hive.exec.dynamic.partition.mode=nonstrict;

-- 创建表字段
create table t_all_hero_part_d
(
    id           int,
    name         string,
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range string,
    role_main    string,
    role_assist  string
) partitioned by (role string)
    row format delimited
        fields terminated by "\t";

-- 动态导入分区数据
insert into table t_all_hero_part_d partition (role)
select th.*, th.role_assist
from tb_heros as th;

多层分区表


-- 多层分区表字段的创建
create table test_student
(
    id   int,
    name string,
    sex  string,
    age  int,
    dept string
) partitioned by (year string, month string,day string) row format delimited fields terminated by ',';

-- 多层分区分区数据导入
load data local inpath '/root/hero/archer.txt' into table test_student partition (year = '2012', month = '01', day = '10');

删除分区

alter table test_student drop partition(year=2012)

分桶表

定义:字段层面对数据划分, 划分结果比分区表更加平均

根据字段的哈希值除以指定分桶的数量,然后对结果取余,把余数相同的放到一个桶

分桶的好处

基于分桶字段查询时,减少全表扫描
join时可以提高MR程序的效率,减少笛卡尔积数量
用分桶表数据进行抽样

分桶操作步骤

-- 创建数据表
create table if not exists t_usa_covid19
(
    count_date string,
    county     string,
    state      string,
    fips       int,
    cases      int,
    deaths     int
)
    row format delimited fields terminated by ',';


-- 导入需要分桶数据(在端口导入数据)
-- 创建分通表
create table tb_usa_covid19_bucket_sort
(
    count_date string,
    county     string,
    state      string,
    fips       int,
    cases      int,
    deaths     int
) clustered by (state) sorted by (cases desc) into 5 buckets;

-- 导入分桶数据
insert into tb_usa_covid19_bucket_sort
select *
from t_usa_covid19;

LIUERTOU

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hive的语法-HiveDDL分区/分桶

hive中的语法hive数据类型原生数据类型数值型int, float,字符型string日期型data布尔型bool(ture/false)复杂数据类型array 数组map k-vstruct 结构体 {int, string,…}联合体数据类型转换隐式转换显示转换select cast(‘100’ as int)hive文件的读写定义: Hadoop中的文件数据和hive的表之间的关系#自定义delimited#自己指定ser
复制链接

扫一扫

专栏目录