Hive进阶

liweihope

已于 2023-03-10 23:39:28 修改

阅读量372

点赞数

文章标签： hive 大数据 hadoop

于 2019-03-10 11:30:18 首次发布

本文链接：https://blog.csdn.net/liweihope/article/details/88375740

版权

本节主要内容：

一、分区表（静态分区、动态分区）

二、用hiveserver2和beeline 访问hive

三、复杂数据类型（如何存取）

一、分区表（静态分区、动态分区）

PARTITION分区表：

分区表
   话务记录、日志记录 rdbms
   记录表是需要分表的每天的记录都分成一部分：
       call_record_20190808
       call_record_20190809
       call_record_20190810

大数据分区表
   /user/hive/warehouse/emp/d=20190808/.....
   /user/hive/warehouse/emp/d=20190809/.....
当需要查询的时候只需要： select .... from table where d='20190808' 就行了

where后面带上分区的条件，它会到相应的分区里面查询，而不需要在整张表里面查询。

这样的话能够带来非常大的提升。如果查询整张表的话，是很慢很慢的。

大数据经常遇到的瓶颈问题：IO

有几个方面：磁盘（disk） IO 第二个：网络（network）IO

以后在优化的过程中必然要考虑的两点。

下面是分区的练习：

在/home/hadoop/data/目录下有个order.txt订单文件：（有订单编号和时间两个字段）

创建一张分区表order_partition：

create table order_partition(
order_no string,
event_time string
)
PARTITIONED BY(event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

把数据加载到表里：

load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_partition
PARTITION (event_month='2014-05');

加载之后查询一下：

从上面看到有三列，前面两列是真正的列，是字段名，最后一列并不是真正的列，它只是分区的一个标识，是伪列。desc可以看一下：

然后在order_partition下面手动创建一个分区event_month=2014-06

然后把order.txt文件丢进去：

然后再去hive里查一下这张表：

发现并没有2014-06的分区；为什么？？？

是因为2014-06这个分区是手动去创建的，并不会在mysql的元数据里。你去查的话是查不到的。

可以去mysql的partitions表里看一下：

（或者这样：）

此时如果要想加上这个分区该如何操作？？看官网：

根据官网写好这个语句加个分区：

ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION (event_month='2014-06') ;

然后再查一下就有分区了：

那么在hive里如何查看表有哪些分区？？

show partitions 表名; ：

（这些都是从mysql的元数据里查出来的）

上面是创建一级分区，怎样创建多级分区？？

create table order_mulit_partition(
order_no string,
event_time string
)
PARTITIONED BY(event_month string, step string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

然后加载数据到表里：

load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_mulit_partition
PARTITION (event_month='2014-05', step='1');

（网页查看）

什么时候会用到多级分区？？

比如数据量很大，按照天进行分区，然后还是很大，再按照小时进行分区。如果你去查的话，按照小时去查会更快。

在生产上面，数据量大的话会有好几层分区。

在写查询语句的时候，一定把条件中分区写到最底层，不然数据量很大的话，可能会被刷屏：

select * from order_mulit_partition where event_month='2014-05' and step='1';

上面有一级分区、多级分区，这些都是静态分区。

还有动态分区。

（小技巧：获取一张表的创建语句： show create table 表名）

现在先创建一张静态分区表emp_static_partition：

CREATE TABLE `emp_static_partition`(
`empno` int,
`ename` string,
`job` string,
`mgr` int,
`hiredate` string,
`sal` double,
`comm` double)
partitioned by(deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

然后向表里插入数据（这个要跑MapReduce的）：

插入deptno=10的数据到deptno=10分区里：

insert into table emp_static_partition PARTITION (deptno=10)
select empno,ename,job,mgr,hiredate,sal,comm from emp
where deptno=10;

然后查一下数据：select * from emp_static_partition where deptno=10;

插入deptno=20的数据到deptno=20分区里：

insert into table emp_static_partition PARTITION (deptno=20)
select empno,ename,job,mgr,hiredate,sal,comm from emp
where deptno=20;

然后deptno=30、40、50.......

加入有1万个部门呢？是不是要去insert 1万次？

然后就有了动态分区：

动态分区：按照部门编号写到指定的分区中去

先创建一张动态分区表：（创建和静态分区创建是一样的）

CREATE TABLE `emp_dynamic_partition`(
`empno` int,
`ename` string,
`job` string,
`mgr` int,
`hiredate` string,
`sal` double,
`comm` double)
partitioned by(deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

然后向动态分区表里插入数据：

insert into table emp_dynamic_partition PARTITION (deptno)
select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;

（注意：第一个deptno是分区，后面不要加条件，只是一个key，不要value；第二个deptno，前面是插入的字段，而deptno是根据它把数据分到相应的分区里去，根据后面deptno这个字段分到前面deptno这个分区里）

（这个时候报错了，因为遵循严格模式，按照提示，把它修改成非严格模式即可）

set hive.exec.dynamic.partition.mode=nonstrict

insert完之后：

如果是多级分区（要有deptno、step字段）：

insert into table emp_dynamic_partition PARTITION (deptno，step)
select empno,ename,job,mgr,hiredate,sal,comm,deptno，step from emp;

二、用hiveserver2和beeline 访问hive

上面hive都是hive回车，在里面输入命令，进行操作。

除了上面这种方式，还有什么方式呢？

之前使用是第2种方式，还有1和3，就是beeline和hiveserver2 这两个。

hiveserver2和beeline是配合使用的。（后面spark课程中还有thriftserver+beeline 其实是一模一样）

去官网：

启了一个服务之后，就可以用客户端连到这个服务上面去，就可以执行sql了。HiveServer1已经淘汰了。HiveServer2 支持多并发和授权。

一个服务+客户端。先把服务启起来，然后用客户端连进去。

现在把hiveserver2启起来。可以后端启起来：

比如：nohup命令： nohup /home/hadoop/app/hive-1.1.0-cdh5.7.0/bin/hiveserver2 &

也可以前端启起来：（后端启的话，窗口可以关掉，但是前端启的话不能关掉窗口）

然后另外启动一个窗口，启动beeline：

用法：（参照官网）

beeline -u jdbc:hive2://10-9-140-90:10000/d6_test -n hadoop

这样就连进来了。

（在这个窗口执行sql成功后，在刚才那个前端窗口会出现一个OK。如果失败，那个窗口会出现失败以及失败的原因）

也可以再打开几个窗口，执行beeline去访问（多并发访问）。

以上hiveserver2和beeline只是访问hive的一种方式。可以用也可以不用，看个人习惯。

三、复杂数据类型（需要掌握：如何存？如何取？）

官网：

之前学的都是primitive_type基本数据类型。还有其他数据类型：array_type、map_type、struct_type等。

array_type：

现在有个文件：

然后创建一张表：

create table hive_array(
name string,
work_locations array<string>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','; （集合的分隔符）

然后把数据加载进来：

load data local inpath '/home/hadoop/data/hive_array.txt'
overwrite into table hive_array;

那么如何取值呢？？？

select name,work_locations[0] from hive_array; （取数组的第一个值，数组名[索引]）

select name,size(work_locations) from hive_array; （size(数组名) 取数组的有多少成员查看每个人的工作地点有多少）

select * from hive_array where array_contains(work_locations,'tianjin');

（取工作地点在天津的成员记录用函数array_contains(数组名,'成员')）

map_type：

map : key-value

有个文件：

现在创建一张表：

create table hive_map(
id int,
name string,
members map<string,string>,
age int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' （字段之间的分隔符）
COLLECTION ITEMS TERMINATED BY '#' （集合之间的分隔符，这里是#）
MAP KEYS TERMINATED BY ':'; （key和value之间的分隔符，这里是逗号）
然后把数据加载进来：

load data local inpath '/home/hadoop/data/hive_map.txt'
overwrite into table hive_map;

查看一下：

那么如何取数据呢？？？

select id,name,age,members['father'] from hive_map;

select map_keys(members) from hive_map; （map_keys(数组) ：把所有的key显示出来）

select map_values(members) from hive_map; （map_values(数组) ：把所有的key的值显示出来）

select size(members) from hive_map; （每个人的亲属关系有几个，每个数组有多少个成员）

struct_type：结构体类型（可以存放各种格式的）

有个文件：

（前面IP 后面用户信息（比如：姓名、年龄、职业、爱好等来表示一个用户的信息））

现在创建一张表：

create table hive_struct(
ip string,
userinfo struct<name:string,age:int>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'
COLLECTION ITEMS TERMINATED BY ':';

然后加载数据进去;

load data local inpath '/home/hadoop/data/hive_struct.txt'
overwrite into table hive_struct;

查看一下：

那么如何取数据呢？

select userinfo.name,userinfo.age from hive_struct; （用 . 的方式）

以上复杂数据类型不需要强记。

liweihope

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Hive进阶

本节主要内容：一、分区表（静态分区、动态分区）二、用hiveserver2和beeline 访问hive三、复杂数据类型（如何存取）一、分区表（静态分区、动态分区）PARTITION分区表：分区表话务记录、日志记录 rdbms 记录表是需要分表的每天的记录都分成一部分： call_record_20190808 ...
复制链接

扫一扫