hive内部表，外部表，分区表，桶表，函数及自定义函数

最新推荐文章于 2023-02-20 13:37:17 发布

卡奥斯道

最新推荐文章于 2023-02-20 13:37:17 发布

阅读量2.4k

点赞数

分类专栏： hive

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/kaaosidao/article/details/77622711

版权

hive 专栏收录该内容

19 篇文章 1 订阅

订阅专栏

Hive的概念：
Hive是大数据领域中的数据仓库的基础框架
Hive是一个SQL解析引擎
Hive操作方式：
hive -S -f hive-srcipt.hql > result.csv
hive -S -e "" > result.csv
Hive表的操作：
创建的方式和目前mysql一致
数据的加载方式：load data local inpath 'linux_path' into table tableName;
数据类型：
符合数据类型：
array ------>java中的array|arrayList
调用方式arr[index] index从0开始
map ------>java中的map
调用方式map["key"]
struct ------>java中的object
调用方式obj.property
分隔符：
行的默认分隔符：\n
列的默认分隔符：\001 ---->ctrl+v ctrl+a
array的默认分隔符：\002 ---->ctrl+v ctrl+b
map中的k和v之间的默认分隔符：\003 ---->ctrl+v ctrl+c

创建一张完整的包涵了各种数据类型的表，以及有相应自定义分隔符的表的定义：
create table emp(
empid int,
ename string,
salary float,
subordinate array<int>,
tax map<string, float>,
address struct<province:string, city:string, zip:int>
) row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

常见的分隔符：| , ## \t \001 \002 \003
不要出现：& $

两种数据加载的模式：
读模式
写模式

hive采用的是读模式
-----------------------------------------------------------------------------------------------
Hive表的分类
分为两种表：内部表和外部表
分为两种功能表：分区表和桶表

内部表
所谓的内部表，指的是表的数据的存在与否受到了表定义(元数据)的影响，当删除表定义的时候，表中的数据随之一并被删除。
内部表，MANAGED_TABLE，也罢内部表称之为受控表
外部表
所谓的外部表，指的是表的数据的存在不受表定义(元数据)的影响，当删除表定义的时候，表中的数据依然存在，表中的数据
相当于是表对数据的一个引用，当删除表的时候，删除了表和数据之间的一个引用关系。
外部表，EXTERNAL_TABLE
外部表的DDL：
create external table t5_external(
id int
);
删除外部表
drop table t5_external;
加载数据：
（不建议）load data local inpath '/opt/data/hive/hive-t5-external.txt' into table t5_external;
引用的方式：
alter table t5_external set location 'hdfs_path';
eg:
alter table t5_external set location '/input/hive';
在创建表的时候指定数据：
create external table t5_external_1(
id int
) location '/input/hive';
作用：
可以提高数据的安全性
通过对数据的引用，达到多部门共同使用同一份数据，而不用做冗余
在hive和外部框架进行集成的时候，多用外部表去关联操作相应框架，比如hbase
内部表和外部表之间进行转换：
内部表----->外部表
alter table t set tblproperties('EXTERNAL'='TRUE');
外部表----->内部表
alter table t set tblproperties('EXTERNAL'='FALSE');
分区表：
为了提高单表的处理能力，满足只操作一部分数据的需求，而不全量进行操作，所以产生了分区表。
分区需要指定分区字段及其对应的分区值。
DDL
创建：
create table t6_partition(
id int,
name string,
birthday date,
online boolean
) partitioned by(dt date comment "partition field day time");

查看分区：
show partitions t6_partition;
增加分区：
alter table t6_partition add partition(dt="2017-07-20");
删除分区：
alter table t6_partition drop partition(dt="2017-07-20");
DML
加载数据：
load data local inpath '/opt/data/hive/hive-t6-partion.txt' into table t6_partition partition(dt='2017-07-19');
DQL
查询操作和之前的一模一样，将分区字段作为表中的一列来进行对待即可。
如果有多个统计维度的时候，可以采用多个分区来设置
create table t6_partition_1(
id int,
name string,
birthday date,
online boolean
) partitioned by(year int, class string);

桶表：（分区表搞不定分区数据量大小相差不大）（本地模式不起作用）
在hive中，将数据进行打散的另一种方式就是分桶，将数据打散到一个个的文件，这一个个文件称
之为桶文件，在创建表的时候，需要预先设定需要几个桶来进行存储数据，同时需要明确指定是按照哪
一个字段进行分桶操作。本地模式不起作用
DDL：
create table t7_bucket(id int) clustered by (id) into 3 buckets;
查看信息：desc extended t7_bucket
DML:
加载数据：
不要执行load
字段类型要一致
insert into t7_bucket select id from t5_external_1;
作用：（图）
提高多表关联的查询效率
做数据的抽样采集
视图：
创建一个视图：
create view t8_view as select * from t1;
执行查询操作和操作table一样的

复制一张表(复制的表的结构)：
create table tt like t1;
Hive的数据加载方式总结
[]==>可选，<> ==>必须
加载
load
load data [local] inpath 'path' [overwrite] into table [partition_psc];
local：
有==>从linux本地加载数据
无==>从hdfs加载数据，相当于h7dfs dfs -mv 移动数据
overwrite
有==>覆盖掉表中原来的数据
无==>在原来的基础上追加新的数据
从其他表加载
insert <overwrite|into> t_des select [...] from t_src [...];
overwrite
有==>覆盖掉表中原来的数据
无==>在原来的基础上追加新的数据
==>会转化成为MR执行
需要注意的地方：t_des中列要和select [...] from t_src这里面的[...]一一对应起来。
创建表的时候加载
create table t_des as select [...] from t_src [...];
这样会创建一张表，表结构为select [...] from t_src中的[...]

动态分区的加载
快速复制表结构
create table t_d_partition like t_partition_1;
hive (default)> show partitions t_partition_1;
OK
partition
year=2015/class=bigdata
year=2015/class=linux
year=2016/class=bigdata
year=2016/class=linux
要将2016的数据都到入到t_d_partition的相关的分区中
insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016;
要将t_partition_1中所有数据都到入到t_d_partition的相关的分区中
insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1;
从hdfs上面删除的数据，并没有删除表结构，我们show partitions t_d_partition;是从metastore中查询出来的内容，如果你
之手动删除的hdfs上面数据，它的元数据信息依然在。
需要将hive.exec.dynamic.partition.mode设置为nonstrict
<property>
<name>hive.exec.max.dynamic.partitions</name>
<value>1000</value>
<description>Maximum number of dynamic partitions allowed to be created in total.</description>
</property>
import
import table stu from '/data/stu';
导出
1°、在hdfs的直接上操作
hadoop fs -cp src_uri dest_uri
或者
hive> export table tblName to 'hdfs_uri';
2°、在终端使用directory
insert overwrite [local] directory 'linux_fs_path' select ...from... where ...;
Hive的函数&自定义函数
查看Hive提供的函数
show functions;
查看具体的函数：
desc function [extended] func_name;
count
if
case when
split
explode
row_number
collect_set

使用case when对结果进行分类输出
select id, case id
when 1 then "TEACHER"
when 2 then "STUDENT"
when 3 then "SALER"
when 4 then "OFFICER"
else "OTHER" end
from t7_bucket;
使用hql编写wordcount的案例(alter table t rename to t_wc;修改表名)
列名line
hello you
hello me
hello xiaomei

---->
count(1) group by line
---->
hello
you
hello
me
hello
xiaomei
------------
hello you.split(" ") ---->["hello", "you"]
---->hello
you
step 1.将表中某一列的每一行数据使用split函数切割成一个个的单词
select split(line, " ") from t_wc;
["hello","you"]
["hello","xiao","mei"]
["mei","i","hate","you"]
step 2.需要将这个数组中的数据，立起来，相当于行列的转置
select explode(split(line, " ")) as word from t_wc;
hello
you
hello
xiao
mei
mei
i
hate
you
step 3.在step2的基础之上统计，每一个单词出现的次数
select w.word, count(w.word) as count
from (select explode(split(line, " ")) as word from t_wc) w
group by w.word
order by count desc;
在工作过程中，尽量少用order by进行排序，
sort by column ---->按照column进行局部排序
distribute by column 按照某一列进行分开
distribute by column sort by column;这个就和order by是等效的
-----上述的简写方式
cluster by column;
说明这种排序默认就是asc，不支持在最后面跟asc和desc
对表中的数据进行分组排序
row_number
根据员工、部分、薪资，这三张表，
1、分组显示每一个部分员工的信息（启动显示部分名称，员工姓名，员工性别[男|女]，员工薪资），同时分组按照员工薪资降序排序
select
e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null;
2、获取显示部门薪资top2的员工信息
select
tmp.*
from
(select
e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null) tmp
where tmp.rank < 3;
当Hive提供的函数满足不了业务需求的时候，就需要自定义函数来满足业务需要(UDF User Define Function, UDTF User Define Table Function, UDAF User Define Aggregation Function)。
UDF:一个输入，一个输出，比如year，date_sub
UDAF：多个输入，一个输出，比如count，sum，avg
UDTF：操作行列，比如explode
步骤：
1、编写一个类继承UDF
2、覆盖其中的evaluate()方法
3、打成jar包，上传到部署hive的及其上面
4、添加进hive的运行时classpath
hive> add jar /opt/jars/hive/zc.jar
5、创建一个临时函数并调用之
create temporary function z_c as 'com.uplooking.bigdata.udf.ZodiacAndConstellationUDF';
6、销毁临时函数(只在当前会话hive客户端有效)---》一般

创建永久函数：extends UDFE ，在相应地方进行注册 hive项目重新编译打包。
maven依赖：
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hive-api.version>2.1.0</hive-api.version>
<hadoop-api.version>2.6.4</hadoop-api.version>
<hadoop-core.version>1.2.1</hadoop-core.version>
</properties>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop-core.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-cli</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libfb303</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
Hive的JDBC
6个步骤：
1、注册驱动
2、获取Connection
3、获得执行器Statment
4、执行SQL
5、封装结果集ResultSet
6、关闭资源(rs.close(), st.close(), con.close())
小的操作技巧：
在hive终端可以执行hdfs shell 和linux的命令：
hdfs shell操作方式：
hive> dfs xxxx;
eg. hive> dfs -ls /input;
linux命令：
hive> ! linux_cmd;
eg.hive> !ls;

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hive内部表，外部表，分区表，桶表，函数及自定义函数

环境：CentOS-6.5、hive-2.1.0。首先安装好JDK、hadoop、mysql。mysql对远程可访问的机器进行授权
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。