Hive笔记

最新推荐文章于 2024-08-15 19:29:58 发布

1289902828

最新推荐文章于 2024-08-15 19:29:58 发布

阅读量169

点赞数

分类专栏： Hive 文章标签： hive 数据库 hadoop

本文链接：https://blog.csdn.net/weixin_44611266/article/details/100739275

版权

Hive 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、创建库
create database [if not exists] 库名
[comment ‘注释’]
[location “存储路径”]（不推荐使用）
[with dbproperties (‘a’=‘aaa’,‘b’=‘bbb’)];

2、查看库
查询库列表：show databases;
查询库详细信息：desc database [extended] 库名;
查询建库的详细信息：show create database 库名;

3、删除库(通常先删表再删库)
drop database 库名;
drop database if exists 库名;
drop database if exists 库名 [restrict|cascade]; （默认restrict，cascade强制删库跑路吗？？？）

4、切换库
use 库名；

5、查看数据库里面的表
show tables;
show tables in 库名;

6、创建表
create [external] table [if not exists] [库名.]表名
[(列名数据类型 [comment ‘列注释’], …)]
[comment ‘注释’]
[partitioned by (列名数据类型 [comment ‘注释’], …)]
[clustered by (列名, 列名, …) [sorted by (列名 [asc|desc], …)] into 桶数 buckets]（asc 升序 desc降序）
[row format
delimited
[fields terminated by ‘字符’] (列)
collection items terminated by ‘字符’
map keys terminated by ‘字符’
lines terminated by ‘字符’
|serde
]
[stored as 文件格式[textfile|sequencefile|rcfile]]
[location ‘存储路径’];

对表按列值分区再对列值分桶，提高运算效率

7、查看表结构
desc formatted 表名;

8、查看表分区
show partitions 表名;

9、用like关键字复制表
create [external] table [if not exists] 新表名 like 旧表名;

10、重命名表
alter table 旧表名 rename to 新表名;

11、修改表属性
alter table 表名 set tblproperties (’ ‘=’ ', …);

12、修改SerDe信息
alter table 表名 set serdeproperties (’ ‘=’ ', …)

13、增加/改变/替换列
alter table 表名 add columns (列名数据类型, …);
alter table 表名 change columns 旧列名新列名数据类型 [first|after 列名];
alter table 表名 replace columns (列名数据类型, …);（替换所有）

14、增加/删除分区
alter table 表名 add partition(part=‘a’) partiton(part=‘b’) …[location ‘路径’];
alter table 表名 drop partition(part=’ ‘) partition(part=’ ');

防止分区被删除
alter table 表名 partition (part=’ ‘) enable nodrop;
防止分区被查询
alter table 表名 partition (part=’ ') enable offline;

15、删除表
drop table [if exists] 表名;

16、清空表
truncate table 表名 [partition(’ ‘=’ ')];

17、查询显示命令

查看库：show databases;
show databases like ‘my*’;
查看表：show tables;
show tables in 库名;
查看建表完整语法：show create table 表名;
查看内置函数库：show functions;
查看函数的详细手册：desc function extended 函数名;
查看分区：show partitions 表名;
查看表的字段：desc 表名;
查看表的详细信息：desc extended 表名;
查看表的格式化了之后的详细信息：desc formatted 表名;

18、加载数据
hive>load data [local] inpath ‘路径’ [overwrite] into table 表名;
shell命令 hadoop fs -put 文件名路径
hive>load data inpath “路径” into table 表名;
load data local inpath “路径” overwrite into table表名;(overwrite覆盖)

19、利用insert关键字往表中插入数据

单条数据插入：
insert into table 表名 values(001,‘huangbo’,‘male’,50,‘MA’);

单重插入模式： insert … select …
insert into table 表名 select id,name,sex,age,department from 表名;
注意：查询出的字段必须是student表中存在的字段

多重插入模式：
from 表0
insert into table 表1 select id,name,sex,age
insert into table 表2 select id,department;

from 表0
insert into table 表1 partition(department=‘MA’) select id,name,sex ,age where department=‘MA’
insert into table 表1 partition(department=‘CS’) select id,name,sex ,age where department=‘CS’;
（好处：减少扫描次数）

静态分区插入：
需要手动的创建分区
alter table 表名 add partition (city=“zhengzhou”)
load data local inpath ‘/root/hivedata/student.txt’ into table 表名 partition(city=‘zhengzhou’);

动态分区插入：
打开动态分区的开关：set hive.exec.dynamic.partition = true;
设置动态分区插入模式：set hive.exec.dynamic.partition.mode = nonstrict

create table 表名(name string, department string) partitioned by (id int) …
insert into table 表1 partition(id) select name,department,id from 表2;
student表字段：name,department，分区字段是id
查询字段是：name,department,id，分区字段
注意：动态分区插入的分区字段必须是查询语句当中出现的字段中的最后一个

CTAS(create table … as select …)(直接把查询出来的结果存储到新建的一张表里)
create table student as select id,name,age,department from 表名;
注意：自动新建的表中的字段和查询语句出现的字段的名称，类型，注释一模一样

限制：
1、不能创建外部表
2、不能创建分区表
3、不能创建分桶表

分桶插入：

创建分桶表：
create table 表名(id int, name string, sex string, age int, department string)
clustered by(id) sorted by(age desc) into 4 buckets
row format delimited fields terminated by ‘,’;

插入数据：
insert into table 表名 select id,name,sex,age,department from mingxing2
distribute by id sort by age desc;
注意：查询语句中的分桶信息必须和分桶表中的信息一致

20、利用insert导出数据到本地或者hdfs

单模式导出数据到本地：
insert overwrite local directory ‘/root/outputdata’ select id,name,sex,age,department from 表名 …(查询条件);

多模式导出数据到本地：
from 表名
insert overwrite local directory ‘/root/outputdata1’ select id, name
insert overwrite local directory ‘/root/outputdata2’ select id, name,age

简便路径模式导出到hdfs：
insert overwrite directory ‘/root/outputdata’ select id,name,sex,age,department from 表名;

全路径模式查询数据到hdfs：
insert overwrite directory ‘hdfs://hadoop01:9000/root/outputdata1’ select id,name,sex,age,department from 表名;

local ：导出到本地目录
overwrite ：表示覆盖

21、select查询
select [all | distinct] …
from 表1 a
[join 表2 b on a.id=b.id][join 表3 on …]
[where …]
[group by …[having …]]
[cluster by … | [distribute by …][sort by … | order by … [asc|desc]]
[limit 数字]（必须有，数据量过大不能全部输出）

order by : 全局排序
如果一个HQL语句当中设置了order by，那么最终在HQL语句执行过程中设置的
set mapreduce.job.reduces = 4 不起作用。！！

sort by ：局部排序
一般来说，要搭配分桶操作使用
distribute by id sort by age desc;

distribute by : 纯粹就是分桶
在使用distribute by的时候：要提前设置reduceTask的个数
set mapred.reduce.tasks=数字;

cluster by ：既分桶，也排序
cluster by age = distribute by age sort by age;
distribute by age sort by age,id != cluster by age sort by id

cluster by 和 sort by 不能同时使用
建议写 distribute by … sort by …

22、join查询
限制:
支持等值连接，不支持非等值连接
支持 and 操作，不支持 or
支持超过2个表的连接
经验：
当出现多个表进行连接时，最好把小表放置在前面！！把大表放置在最后
join分类
inner join
left outer join
right outer join
full outer join
left semi join
它是in、exists的高效实现

select a.* from a left semi join b on a.id = b.id
等价于：
select a.* from a where a.id in (select b.id from b);

设置本地模式运行
set hive.exec.mode.local.auto=true;

带ed的关键字都是在创建表的时候使用

array
建表
create table 表名(name string,数组字段名 array)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’;

数据：
huangbo beijing,shanghai,tianjin,hangzhou
xuzheng changchu,chengdu,wuhan
wangbaoqiang dalian,shenyang,jilin

导入数据：
load data local inpath ‘/home/hadoop/person.txt’ into table 表名;

查询语句：
Select * from 表名;
Select name from 表名;
Select work_locations from 表名;
Select work_locations[0] from 表名;

map

建表：
create table 表名(name string, scores map<string,int>)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’
map keys terminated by ‘:’;

数据：
huangbo yuwen:80,shuxue:89,yingyu:95
xuzheng yuwen:70,shuxue:65,yingyu:81
wangbaoqiang yuwen:75,shuxue:100,yingyu:75

导入数据：
load data local inpath ‘/home/hadoop/score.txt’ into table 表名;

查询语句：
Select * from 表名;
Select name from 表名;
Select scores from 表名;
Select s.scores[‘yuwen’] from 表名 s;

struct

建表：
create table 表名(id int,course structname:string,score:int)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’;

数据：
1 english,80
2 math,89
3 chinese,95

导入数据：
load data local inpath ‘/ home/hadoop / structtable.txt’ into table 表名;

查询语句：
Select * from 表名;
Select id from 表名;
Select course from 表名;
Select t.course.name from 表名 t;
Select t.course.score from 表名 t;

视图

创建视图（CVAS）
create view 视图名 as select * from 表名 limit 500;

查看视图
show views; // 在新版中可以使用这个命令查看视图列表
show tables; // 可以查看表，也可以查看视图
desc 视图名 // 查看某个具体视图的信息

删除视图
drop view 视图名
drop view if exists 视图名

使用视图
create view 视图名 as select * from 表名 where rank > 3 ;
select count(distinct 字段名) from 视图名;

查看内置函数：
show functions;
显示函数的详细信息：
desc function 函数名;
显示函数的扩展信息：
desc function extended 函数名;

select str_to_map(“a:1,b:2,c:3”,",","😊;
输出 {“a”:“1”,“b”:“2”,“c”:“3”}

select explode(str_to_map(“a:1,b:2,c:3”,",","😊);
输出
a 1
b 2
c 3

UDF
UDAF
UDTF
UDF
大部分的单行函数，hive都有自定义了
少部分需求需要自定义：
class MyUDF extends UDF{
evaluate(); // 可以重载
}
map_keys
map_values
UDAF
max min sum count distinct avg
collection_set
UDTF
explode(array|map)
例子程序：
select explode(split(“a,b,c,d”, “,”));
select explode(str_to_map(“a:1,b:2,c:3”, “,”, “:”));

// 原始数据
huangbo a:1,b:2,c:3

// 需求数据
huangbo a 1
huangbo b 2
huangbo c 3

创建maven项目
创建java类，继承UDF，重载evaluate方法
打成jar包上传到服务器
将jar包添加到hive的classpath
hive>add JAR /home/hadoop/hivejar/udf.jar;
查看加入的jar的命令：
hive>list jar;
创建临时函数与开发好的class关联起来
hive>create temporary function 自定义函数名 as 类的全限定名

transform

########python######代码

vi weekday_mapper.py

#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movie,rate,unixtime,userid = line.split(’\t’)
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ‘\t’.join([movie, rate, str(weekday),userid])

create table newrate(movie int, rate int, weekday int, userid int) row format delimited
fields terminated by ‘\t’;

然后，将文件加入 hive 的 classpath：
hive>add file /home/hadoop/weekday_mapper.py;

insert into table newrate
select
transform(movie,rate,unixtime,userid)
using ‘python weekday_mapper.py’
as(movie,rate,weekday,userid)
from rate;

从rate表中读取一行数据，包含(movie,rate,unixtime,userid)字段
然后使用 weekday_mapper.py 这个python脚本然后把这四个字段
解析成新的四个字段：(movie,rate,weekday,userid)
这个脚本只做了一件事：unixtime转换成weekday星期编号

创建最后的用来存储调用 python 脚本解析出来的数据的表：lastjsontable
create table lastjsontable(movie int, rate int, weekday int, userid int) row format delimited
fields terminated by ‘\t’;

最后查询看数据是否正确：
select distinct(weekday) from lastjsontable;

正则表达式大全

使用RegexSerDe通过正则表达式抽取字段

1::xuzheng::22
2::wangbo::33

create table test_doublebi (id string,name string, age int)
row format
serde ‘org.apache.hadoop.hive.serde2.RegexSerDe’
with serdeproperties(‘input.regex’=’(.):😦.):😦.)’,‘output.format.string’=’%1 $s$ s %3$s’)
stored as textfile;
(.)是任意字符串

load data local inpath “doubledl.txt” into table test_doublebi;

join

1、、只支持等值链接，支持 and ，不支持 or
例如：
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
是正确的；
然而：SELECT a.* FROM a JOIN b ON (a.id>b.id)是错误的。
2、、可以 join 多于 2 个表
例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果 join 中多个表的 join key 是同一个，则 join 会被转化为单个 map/reduce 任务，例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
被转化为单个 map/reduce 任务，因为 join 中只使用了 b.key1 作为 join key。

例如：SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件，而
b.key2 用于第二次 join。

3、 Join 时，每次 map/reduce 任务的逻辑
reducer 会缓存 join 序列中除了最后一个表的所有表的记录，再通过最后一个表将结果序
列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中，应该把最大的
）那个表写在最后（否则会因为缓存浪费大量内存）。例如：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表都使用同一个 join key（使用 1 次 map/reduce 任务计算）。Reduce 端会缓存 a 表
和 b 表的记录，然后每次取得一个 c 表的记录就计算一次 join 结果，类似的还有：
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
这里用了 2 次 map/reduce 任务：
第一次缓存 a 表，用 b 表序列化；
第二次缓存第一次 map/reduce 任务的结果，然后用 c 表序列化。

4、、 HiveJoin 分三种：inner join, outer join, semi join
其中：outer join 包括 left join ，right join 和 full outer join, 主要用来处理 join 中空记录的
情况

tablea 表数据：
1,huangbo
2,xuzheng
4,wangbaoqiang
6,huangxiaoming
7,fengjie
10,liudehua
tableb 表的数据：
2,20
4,50
7,80
10,22
12,33
15,44

load data local inpath ‘/home/hadoop/a.txt’ into table tablea;
load data local inpath ‘/home/hadoop/b.txt’ into table tableb;

5、五种 Join 演示
a) 、inner join（内连接）（把符合两边连接条件的数据查询出来）
select * from tablea a inner join tableb b on a.id=b.id;
b) 、left join（左连接，等同于 left outer join）
1、以左表数据为匹配标准，左大右小
2、匹配不上的就是 null
3、返回的数据条数与左表相同
select * from tablea a left join tableb b on a.id=b.id;
c) 、right join（右连接，等同于 right outer join）
1、以右表数据为匹配标准，左小右大
2、匹配不上的就是 null
3、返回的数据条数与右表相同
select * from tablea a right join tableb b on a.id=b.id;
d) 、left semi join（左半连接）（因为 hive 不支持 in/exists 操作（1.2.1 版本的 hive 支持
in 的操作），所以用该操作实现，并且是 in/exists 的高效实现）
select * from tablea a left semi join tableb b on a.id=b.id;
当你需要进行关联查询，但是又只是需要从一张表中获取数据，这样的话，完美的高效实现就是semi join
e) 、full outer join（完全外链接）
select * from tablea a full outer join tableb b on a.id=b.id;

内连接和左半连接的区别

一、理论
HIVE中都是按等值连接来统计的，理论上两种写法统计结果应该是一致的；

二、实际情况
但实际使用中发现两种写法会返回的结果，总会有一些差距虽然差别不大，但让人很是困惑。

三、原因
当使用join on的时候，如果右表有重复数据就会关联更多的数据，因为它们都符合join on上的条件；
而使用left semi join的时候，当join左侧表中的记录在join右侧表上查询到符合条件的记录就会返回，
而不会再继续查找右侧表中剩余记录了，所以即使右侧表中有重复，也不会对统计结果有啥影响。

标准的mapjoin语法
// 显式指定运行成mapjoin程序
select /+mapjoin(a)/ a.* , b.* from a join b on a.id = b.id;
a是小表，用mapjoin

// 默认情况会跑成reducejoin程序。
// 如果这两张表的某一张表的大小小于25M的话，会自动转换成mapjoin
select a.* , b.* from a join b on a.id = b.id;

4、mapjoin的附带好处：
能实现非等值连接

开关（默认）：true
条件：25M

1289902828

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive笔记

1、创建库create database [if not exists] 库名[comment ‘注释’][location “存储路径”]（不推荐使用）[with dbproperties (‘a’=‘aaa’,‘b’=‘bbb’)];2、查看库查询库列表：show databases;查询库详细信息：desc database [extended] 库名;查询建库的详细信息：s...
复制链接

扫一扫

专栏目录