hive分区、分桶、sql语句

最新推荐文章于 2024-08-05 13:26:31 发布

wtzhm

最新推荐文章于 2024-08-05 13:26:31 发布

阅读量6.7k

点赞数 1

分类专栏： hive 文章标签： hive 分桶 hive分区 hive sql语句

本文链接：https://blog.csdn.net/wtzhm/article/details/79077878

版权

hive 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

hive分区、分桶、sql语句

1. sql语句

1.1 创建表

CREATE TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[fields terminated by ] 
[STORED AS file_format] SEQUENCEFILE|TEXTFILE|RCFILE
//如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。
[LOCATION hdfs_path]  //文件的路径
[collection items terminated by]  //集合之间的元素分隔
[map keys terminated by]  //map的可以和value之间的分隔

eg1: 普通表
create table stu_buck(Sno int comment ‘学号’,Sname string comment ‘姓名’,Sex string,Sage int,Sdept string)
comment ‘个人信息表’ sorted by(Sno DESC)
row format delimited //以行为单位 -->划定界限的
fields terminated by ‘,’; //“，”间隔字段 -->终止

eg2: 带集合的例子

	create table tb_test(
		    id      int
		   ,name    string
		   ,hobby   array<string>
		   ,addr     map<string,string>
		)
		row format delimited
		fields terminated by ','
		collection items terminated by '-'
		map keys terminated by ':'
		;
    插入的数据格式：
    1,xiaoming,book-TV-code,beijing:chaoyang-shagnhai:pudong
	2,lilei,book-code,nanjing:jiangning-taiwan:taibei
	3,lihua,music-book,heilongjiang:haerbin

1.2 Load 插入数据

 ```
语法结构：
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO 
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

说明：
1、Load 操作只是单纯的复制/移动操作，将数据文件移动到 Hive 表对应的位置。

2、filepath：
相对路径，例如：project/data1 
绝对路径，例如：/user/hive/project/data1 
包含模式的完整 URI，列如：
hdfs://namenode:9000/user/hive/project/data1

3、LOCAL关键字
如果指定了 LOCAL， load 命令会去查找本地文件系统中的 
filepath。如果没有指定 LOCAL 关键字，则根据inpath中的uri[如果指定了 LOCAL，那么：load 命令会去查找本地文件系统中的 filepath。如果发现是相对路径，则路径会被解释为相对于当前用户的当前路径。load 命令会将 filepath中的文件复制到目标文件系统中。目标文件系统由表的位置属性决定。被复制的数据文件移动到表的数据对应的位置。

如果没有指定 LOCAL 关键字，如果 filepath 指向的是一个完整的 URI，hive 会直接使用这个 URI。 否则：如果没有指定 schema 或者 authority，Hive 会使用在 hadoop 配置文件中定义的 schema 和 authority，fs.default.name 指定了 Namenode 的 URI。 
如果路径不是绝对的，Hive 相对于/user/进行解释。Hive 会将 filepath 中指定的文件内容移动到 table （或者 partition）所指定的路径中。]查找文件

4、OVERWRITE 关键字
如果使用了 OVERWRITE 关键字，则目标表（或者分区）中的内容会被删除，然后再将 filepath 指向的文件/目录中的内容添加到表/分区中。 如果目标表（分区）已经有一个文件，并且文件名和filepath 中的文件名冲突，那么现有的文件会被新文件所替代。

eg: load data local inpath '/hadoopdata/hivetest/test' into table test1;

1.3 select 语句

语法结构
SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[CLUSTER BY col_list 
  | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] 
] 
[LIMIT number]

1、order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
2、sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
3、distribute by根据distribute by指定的内容将数据分到同一个reducer。
4、Cluster by 除了具有Distribute by的功能外，还会对该字段进行排序。因此，常常认为cluster by = distribute by + sort by

1.4 其他sql语句

创建数据库
如果库不存在就创建，注释，指定数据库在hdfs上的路径
create database if not exists t1 comment ‘learning hive’ location ‘/etljob/warehouse/t1.db’ ;
查看数据库
查看有哪些数据库：show databases;
查看具体书库的详细属性信息： show database t1 ;
查看正在使用的数据库： select current_database();
查看创建数据库语句： show create database t1；
删除数据库
删除不带表的数据库： drop database t1;
删除带表的数据库：默认情况下，hive 不允许删除包含表的数据库，使用 cascade 关键字。
drop database if exists dbname cascade;
使用CTAS创建表
create table student_ctas as select * from student where id < 95;
复制表结构
create table student_copy like student
查看数据库中以xxx开头的表
show tables like ‘student_c*’;
查看表的详细信息（格式友好）
desc formatted student;
修改表名
alter table student rename to new_student;
增加一个字段（注意不支持删除字段）
alter table new_student add columns (score int);
修改一个字段的定义
alter table new_student change name new_name string;
添加分区
添加分区时，必须有分区字段eg: city是分区字段，如果没有分区字段，无法添加分区值
alter table student_ptn add partition(city=“chongqing”);
查看分区
show partitions student_ptn;
-删除分区
alter table student_ptn drop partition ( city=“chongqing”)
删除表
drop table new_student;
清空表
truncate table student_ptn;

2. Hive表分区

2.1 为什么要创建分区表?

select查询中会扫描整个表内容，会消耗大量时间。由于相当多的时候人们只关心表中的一部分数据.

2.2 建表的语法

PARTITIONED BY指定按照什么分区，一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下。

2.3 单表分区与多表分区

单表分区:create table partition_table (id int, content string) partitioned by (day string)单分区表，按天分区，在表结构中存在id，content，day三列

注意1：创建分区表的hql语句的时候，partitioned by ，必须紧跟表的字段后面
注意2：partitioned by (分区字段）在 create table partition_table (id int, content string)里面是没有该字段的，分区字段，直接在partitioned by(这里面指定）
create table partition_table (id int, content string) row format delimited fields terminated by ‘,’ partitioned by (day string) 这样就是错误的。

多表分区: create table partition_table (id int, content string) partitioned by (day string, hour string);双分区表，按天和小时分区，在表结构中新增加了day和hour两列。

2.4 添加分区

注意：表已创建该字段的分区，才能添加分区值，在此基础上添加分区:

ALTER TABLE table_name ADD
partition_spec [ LOCATION ‘location1’ ]
partition_spec [ LOCATION ‘location2’ ] …
eg: alter table partition_table add partition(day=‘2018-1-16’);

eg: 重命名分区：alter table sp_trans2 partition (dt=‘2016-01-13’,shop=‘001’) rename to partition (dt=‘2016-01-13’,shop=‘000’)

2.5 删除分区

删除分区语法：ALTER TABLE table_name DROP
partition_spec, partition_spec,…
eg: alter table partition_table drop partition (day=‘2018-1-16’);

2.6 数据加载进分区表中语法

LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …
指定分区加载数据：
eg load data local inpath ‘/hadoopdata/hive/test.txt’ into table partition_table partition (day=‘2018-1-15’);

2.7 查看分区

show partitions partition_table;

2.8 静态分区与动态分区

分区表有静态分区和动态分区两种。若分区的值是确定的，那么称为静态分区字段，反之，若分区的值是非确定的，那么称之为动态分区字段。默认是采用静态分区。 ---->将查询结果插入Hive表
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …)] select_statement1 FROM from_statement
eg: insert overwrite table student partition (dy=‘2018-1-15’) select id,age,name form student where stat_date=‘2018-1-12’;

2.9 导出表数据

eg: insert overwrite local directory /hadoopdata/hive/testdata/student1’ select * from student;

3 分桶

3.1 概念

分桶是相对分区进行更细粒度的划分。分桶将整个数据内容安装某列属性值得hash值进行区分，如要安装name属性分为3个桶，就是对name属性值的hash值对3取摸，按照取模结果对数据分桶。如取模结果为0的数据记录存放到一个文件，取模为1的数据存放到一个文件，取模为2的数据存放到一个文件。桶对应于MapReduce的输出文件分区：一个作业产生的桶（输出文件）和reduce任务个数相同

3.2 创建分桶表

create table t_buck(id string,name string)
clustered by (id) sort by(id) into 4 buckets;
指定了根据id分成4个桶,最好的导入数据方式是insert into table.
注意：要开启模式开关
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
查询时cluster by指定的字段就是partition时分区的key

3.3 分桶的作用

join操作：join操作两个表有一个相同的列，如果对这两个表都进行了桶操作，那么将保存相同列值的桶进行join操作，可以大大减少join的数据量.
高效查询效率：当where语句后面的条件是分桶字段，在查询的时候根据字段的值取hash值，然后按照取模结果对数据分桶，到指定的桶去查询。不需要遍历整个文件。

3.4 分区又分桶

create table logs(ts int ,line string) partitioned by (dt String) clustered by (ts) into 4 buckets row format delimited fields terminated by ‘,’;
分区之后继续分桶，我们在hdfs文件系统上看不出分桶的多个数据表文件，只能看见一个文件，但是能从文件路径上看出分区的信息。
注意：clustered by 必须紧跟表的字段后面

4.问题总结

###4.1 包不兼容异常 ###
hive 运行删除表语句出现异常：“we don’t support retries at the client level"问题

情况一：mysql-connector-java与hive的版本不兼容，升级mysql-connector-jave
情况二：mysql字符集问题：修改mysql的字符集，alter database hive character set latin1;
###4.2 hadoop集群时间不同步问题###
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container
This token is expired. current time is 1526411235575 found 1516159081055
出现这个问题一般是因为hadoop集群中出现了时间不一致问题，同步时间，再运行,ok
hadoop集群同步时间
查看当前时间： date
选择时区，在每台机子上都要执行： cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
网络同步： ntpdate cn.pool.ntp.org ,如果ntp命令不存在，centos 安装 ntpdate 并同步时间
在命令行中做如下操作，来安装ntpdate，yum install -y ntp
或者本地设置：将时间设置为2014年6月18日14点16分30秒（MMDDhhmmYYYY.ss）---- date 0618141614.30

wtzhm

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
hive分区、分桶、sql语句

hive分区、分桶、sql语句1. sql语句1.1 创建表CREATE TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT
复制链接

扫一扫

专栏目录