hadoop hive 数据表

最新推荐文章于 2024-07-13 16:15:27 发布

光军233

最新推荐文章于 2024-07-13 16:15:27 发布

阅读量33

点赞数

文章标签： hadoop hive 大数据

本文链接：https://blog.csdn.net/qq_18628605/article/details/132657047

版权

分区表

当一个表的数据太大了，将其按照一定的规则分割成小的文件，去操作小的文件，每一个分区就是一个文件夹
创建分区表能够有效提高hive的查询性能
各个分区在物理存储上是分离的

多分区

hive支持多个字段作为分区，各个分区之间包含层级关系
例如：

年份
- 月份
  - 周

这样逐层的去进行分区

分区表的基本语法

create table table_name(...) partitioned by (分区列 列类型，... ) row format delimited fields terminated by '';

每个列就是一个层级，表示一个分区
例：

create table score (s_id string, c_id string , s_score integer) partitioned by (month string) row format delimited fields by '\t';

创建一个以月为分区的score表
创建的数据表会比原来的数据多一个列，多一个partition的分区列，列名为month

create table score2 (s_id string, c_id string, c_score integer) partitioned by (year string, month string ,day string) row format delimited fields terminated by '\t';

创建一个多分区的表

load data local inpath '/export/server/hivedatas/score.txt' into table score partition(month='202006';)

将数据加载到分区表中
数据来自于读取的文件，分区列来自于指定的列

分桶表

分区：将表拆分到不同的子文件夹中进行存储
分桶：将表拆分到固定数量的不同文件中

不管文件是怎么样的，都会拆分成固定数量的，例如指定分桶数量为3，就会将一个文件拆分成3个子文件
可以又分区又分桶，先进行分区，在分区的子文件夹中，将子文件夹中存放的文件进行分桶

分桶表的创建

开启分桶的自动优化

set hive.enforce.bucketing=true;

自动匹配reduce task的数量和桶数量一致

创建分桶表

create table course(c_id string, c_name string, t_id string) clustered by (c_id) into 3 buckets row format delimited fields terminated by '\t';

创建一个course表，根据c_id字段进行聚类，基于c_id这个列对文件进行分桶，将其分为三个桶

分桶表的数据加载

对于分桶表而言，数据无法通过load data进行加载，只能通过insert select

因此，创建分桶表的方式一般是

创建一个临时表，通过load data 将数据加载大临时表中
通过insert select 从临时表中，将数据加载到分桶表中

create table courese_common(c_id string, c_name string, t_id string ) row format delimited fields terminated by '\t';
-- 创建普通表
load data local inpath field_path into table course_common;
-- 向普通表中加载数据
insert overwrite table course select * from course_common cluster by (c_id);
-- 从普通表中将数据加载到分桶表中

向分桶表中插入数据时，后面要加上cluster by，要注意插入的没有过去时

为什么不能用load data

数据在放入分桶表中的时候，将数据一份为三，该数据划分的原则时基于分桶列的值进行hash散列。
load data的过程中不会启动mapreduce，就不会对数据进行计算，也就无法执行hash算法，只能进行数据移动，因此无法插入分桶表数据
该hash算法是对指定列clustered by的列的数据进行加密转化，对其进行取模，取模的数基于分桶的数量

分桶表的性能提升

基于分桶列的特定操作都能带来性能提升，例如过滤、join、分组等
分完桶后的表相当于自动完成了基于分桶列的分组

数据表的修改操作

表重命名

alter table _old_name rename to _new_name;

修改表属性

alter table table_name set tblproperties('_column_name' = _new_properties);

添加分区

alter table tbl_name add partition(_part_name = 分区名 );

新的分区里没有数据，相当于新建了一个文件夹，需要手动向其中添加数据

修改分区值

alter table tbl_name partition(_part_name = 原分区名) rename to partition(_part_name = 新分区名);

修改完了以后hdfs的文件夹不会改名，但是hdfs的元数据记录中（即hdfs对应的sql）会修改分区名

删除分区

alter table tbl_bname drop partition (_part_name = 分区名);

对应的hdfs文件夹依旧存在，但是在hive的元数据中对应的分区名被drop掉了

常规表操作

添加列

alter table table_name add columns (v1 int, v2 string);

修改列名

alter table tbl_name change v1 v1new int;

 列的类型要保持一致

删除表

drop table tbl_name;

清空表

truncate table tbl_name;

该操作只能够清空内部表，外部表通过外连接链接到外部文件，不会对外部文件造成影响

光军233

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop hive 数据表

load data的过程中不会启动mapreduce，就不会对数据进行计算，也就无法执行hash算法，只能进行数据移动，因此无法插入分桶表数据。当一个表的数据太大了，将其按照一定的规则分割成小的文件，去操作小的文件，每一个分区就是一个文件夹。数据在放入分桶表中的时候，将数据一份为三，该数据划分的原则时基于分桶列的值进行hash散列。可以又分区又分桶，先进行分区，在分区的子文件夹中，将子文件夹中存放的文件进行分桶。的列的数据进行加密转化，对其进行取模，取模的数基于分桶的数量。
复制链接

扫一扫