hive操作

最新推荐文章于 2022-06-25 21:28:40 发布

果园里的码农

最新推荐文章于 2022-06-25 21:28:40 发布

阅读量317

点赞数 1

分类专栏： hive

本文链接：https://blog.csdn.net/qq_43146036/article/details/83720494

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

启动hive服务

1：使用服务的形式启动hive
nohup hiveserver2 >/dev/null 2>&1 &
2:链接方式
nohup hiveserver2 &
1：beenline
hive --service beeline
!connect jdbc:hive2://hdp2:10000 root 123456
2:beeline -u jdbc:hive2://hdp2:10000 -n root
在做小数据量查询测试时，可以让hive将mrjob提交给本地运行器运行，可以在hive会话中设置如下参数：
hive> set hive.exec.mode.local.auto=true;

1、创建库

hive中有一个默认的库：default
新建库：

create database db_order;

库建好后，在hdfs中会生成一个库目录：

2、表操作

2.1、创建表、字段分隔符为 “,”

create table t_order(id string,create_time string,amount float,uid string)
row format delimited
fields terminated by ',';

2.2、查看表结构

desc tablename

2.3删除表

drop table t_order;

hive会从元数据库中清除关于这个表的信息；
hive会从hdfs中删除这个表的表目录；

2.4、内部表及外部表
2.4.1、创建内部表

 create table t_user(id string,name string) 
 row format delimited 
 fields terminated by ',';

2.4.2、创建外部表

create external table t_access(ip String,url String,access_time String)
row format delimited 
fields terminated by ','
location "/data/acc";

2.4.3、外部表跟内部表的区别：

内部表的目录在hive的仓库目录中 VS 外部表的目录由用户指定
内部表放在warehouse下面，删除表的时候会把数据删除掉
外部表示需要自己制定目录，删除表的时候，不会删除数据

一个hive的数据仓库，最底层的表，一定是来自于外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；
然后，后续的etl操作，产生的各种中间表建议用managed_table（内部表）

2.5、分区表
在表目录中为数据文件创建分区子目录，以便于在查询时，可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

 create table t_access(id string,url string,access_time string)
 partitioned by(dt string)
 row format delimited
 fields terminated by ',';

导入数据

  load data local inpath "/root/acc.txt" into table t_access partition(dt="2018-8-28");

注意：分区字段，不能出现在表字段里面，不同的分区数据存放在不同的目录下面

2.6、CTAS建表语法

2.6.1、通过已存在表来建表，新建的表结构定义与源表一致，但是没有数据

 create table t_user_2 like t_user;

2.6.2、在建表的同时插入数据

create table t_access_user 
as
select ip,url from t_access;

2.7、查询表数据
2.7.1、基本条件查询

select * from t_access;
select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';

2.7.2、join关联查询
假如有a.txt文件

a,1
b,2
c,3
d,4

假如有b.txt文件

a,xx
b,yy
d,zz
e,pp

1、inner join（join）

select 
a.name as aname,
a.age as aage,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

+--------+--------+--------+--------+--+
| aname  | anumb  | bname  | bnick  |
+--------+--------+--------+--------+--+
| a      | 1      | a      | xx     |
| b      | 2      | b      | yy     |
| d      | 4      | d      | zz     |
+--------+--------+--------+--------+--+

2、left outer join（left join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

+--------+--------+--------+--------+--+
| aname  | anumb  | bname  | bnick  |
+--------+--------+--------+--------+--+
| a      | 1      | a      | xx     |
| b      | 2      | b      | yy     |
| c      | 3      | NULL   | NULL    |
| d      | 4      | d      | zz     |
+--------+--------+--------+--------+--+

3、right outer join（right join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

 +--------+--------+--------+--------+--+
 | aname  | anumb  | bname  | bnick  |
 +--------+--------+--------+--------+--+
 | a      | 1      | a      | xx     |
 | b      | 2      | b      | yy     |
 | d      | 4      | d      | zz     |
 | NULL   | NULL   |e       | pp     | 
 +--------+--------+--------+--------+--+

4、full outer join（full join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

 +--------+--------+--------+--------+--+
 | aname  | anumb  | bname  | bnick  |
 +--------+--------+--------+--------+--+
 | a      | 1      | a      | xx     |
 | b      | 2      | b      | yy     |
 | c      | 3      | NULL   | NULL   |
 | d      | 4      | d      | zz     |
 | NULL   | NULL   |e       | pp     | 
 +--------+--------+--------+--------+--+

5、left semi join
Left semi join ：相当于join连接两个表后产生的数据中的左半部分
hive中不支持exist/IN子查询，可以用left semi join来实现同样的效果：

select 
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;

 +--------+--------+--+
 | aname  | anumb  |
 +--------+--------+--+
 | a      | 1      | 
 | b      | 2      | 
 | d      | 4      | 
 +--------+--------+--+

注意： left semi join的 select子句中，不能有右表的字段

2.7.3、分组聚合查询
假设有数据如下：
20180805,192.168.20.18,http://www.edu36.cn/job
20180805,192.168.20.28,http://www.edu36.cn/login
20180805,192.168.20.28,http://www.edu36.cn/login
20180806,192.168.20.38,http://www.edu36.cn/job
20180806,192.168.20.38,http://www.edu36.cn/study

select dt,count(*),max(ip) as cnt from t_access group by dt;

select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170805';

select 
dt,count(*),max(ip) as cnt 
from t_access 
where url='http://www.edu360.cn/job'
group by dt having dt>'20170805';

注意：一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段
having是对group by聚合之后的结果进行再过滤；

2.7.4、子查询
假设有数据如下：
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

select id,name,brother
from 
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;

3、数据导入导出及文件格式

3.1、 load导入数据
通过本地（linux上加载数据）

 load data local inpath "/root/userdata.txt" into table t_user;

从hdfs上加载数据

 load data inpath "/user.data2.txt" into table t_user;

3.2、导出数据
导出到本地

 insert overwrite local directory "/root/access-data"
 row format delimited fields terminated by ":"
 select * from t_user;

导出到hdfs上

 insert overwrite directory "/access-data"
 row format delimited
 fields terminated by ','
 select * from t_user;

3.3、hive文件格式

HIVE支持很多种文件格式： SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE
stored as textfile; //默认的存储格式
stored as sequencefile;
stored as parquetfile;

 create table t_user(id string,name string) 
 row format delimited 
 fields terminated by ','
 stored as sequencefile;

4、hive数据类型

4.1、数字类型

TINYINT (1字节整数)
SMALLINT (2字节整数)
INT/INTEGER (4字节整数)
BIGINT (8字节整数)
FLOAT (4字节浮点数)
DOUBLE (8字节双精度浮点数)

4.2、时间类型

TIMESTAMP (时间戳) (包含年月日时分秒的一种封装)
DATE (日期)（只包含年月日）

4.3、字符串类型

STRING
VARCHAR (字符串1-65355长度，超长截断)
CHAR (字符串，最大长度255)

4.4、复合类型

4.4.1、array数组类型
arrays: ARRAY<data_type>
例：
假设有数据如下：三生三世十里桃花,刘亦菲:痒痒,2017-08-20

create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';

select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'w');
select moive_name,size(actors) from t_movie;

4.4.2、map类型
maps: MAP<primitive_type, data_type>
例：
假设有数据如下：3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29

create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

select * from t_person;

## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;

## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;

## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;

4.4.3、struct类型
structs: STRUCT<col_name : data_type, …>
例：
假设有数据如下：2,lisi,28:female:shanghai

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';

select * from t_person_struct;
select id,name,info.age from t_person_struct;

4.5、其他类型

BOOLEAN（布尔类型）：true false
BINARY (二进制)：

5、hive函数使用

5.1、常用内置函数
5.1.1、类型转换函数

select cast("5" as int) ;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);

5.1.2、数学运算函数

select round(5.4);   ## 5  四舍五入
select round(5.1345,3) ;  ##5.135
select ceil(5.4) ; // select ceiling(5.4) from dual;   ## 6  向上取整
select floor(5.4);  ## 5  向下取整
select abs(-5.4) ;  ## 5.4  绝对值
select greatest(3,5,6) ;  ## 6  
select least(3,5,6) from dual;  ##求多个输入参数中的最小值
select max(age) from t_person;    
 select min(age) from t_person;

5.1.3、字符串函数

substr(string str, int start)   ## 截取子串
substring(string str, int start)
substr(string, int start, int len) 
substring(string, int start, int len)

concat(string A, string B...)  ## 拼接字符串
concat_ws(string SEP, string A, string B...)

length(string A)
split(string str, string pat)
upper(string str) ##转大写
lower(string str)

示例：

select substr("abcdefg",2) from dual;

select substr("abcdefg",2,3) from dual;

select concat("ab","xy") from dual;  ## abxy

select concat_ws(".","192","168","33","44") from dual; ## 192.168.33.44

select length("192.168.33.44") from dual;  ## 13

select split("192.168.33.44","\\.") from dual;

5.1.4、时间函数

select current_timestamp; ## 获取当前的时间戳(详细时间信息)
select current_date;   ## 获取当前的日期

## 取当前时间的秒数时间戳--(距离格林威治时间1970-1-1 0：0：0秒的差距) 
select unix_timestamp();

## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例：select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例： select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");

## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");

5.1.5、条件函数

case when

语法：
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
…
WHEN conditionn THEN resultn
ELSE result
END

select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

select id,if(age>25,'working','worked') from t_user;

select moive_name,if(array_contains(actors,'吴刚'),'好电影',’烂片儿’) 
from t_movie;

5.1.6、集合函数
array_contains(Array, value) 返回boolean值

示例：

select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c') from dual;

sort_array(Array) 返回排序后的数组

示例：

select sort_array(array('c','b','a')) from dual;
select 'haha',sort_array(array('c','b','a')) as xx from (select 0) tmp;

size(Array) 返回一个集合的长度，int值

示例：

select moive_name,size(actors) as actor_number from t_movie;

size(Map<K.V>) 返回一个int值
map_keys(Map<K.V>) 返回一个map字段的所有key，结果类型为：数组
map_values(Map<K.V>) 返回一个map字段的所有value，结果类型为：数组

5.1.7、聚合函数
sum
avg
max
min
count
5.1.8、表生成函数

5.1.9、json解析函数
假设有如下数据：
{“movie”:“1193”,“rate”:“5”,“timeStamp”:“978300760”,“uid”:“1”}
{“movie”:“661”,“rate”:“3”,“timeStamp”:“978302109”,“uid”:“1”}
{“movie”:“914”,“rate”:“3”,“timeStamp”:“978301968”,“uid”:“1”}

1、创建一个原始表用来对应原始的json数据

create table t_json(json string);
load data local inpath ‘/root/rating.json’ into table t_json;

2、利用json_tuple进行json数据解析

select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json limit 10;

3、真正解析整张json表，将解析结果数据插入一张新表

create table t_movie_rate
as
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json;

5.1.10、分析函数 row_number() over()——分组TOPN
假设有如下数据：
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
需要查询出每种性别中年龄最大的2条数据

create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ‘,’;

select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber；

select id,age,name,sex
from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;

5.2、自定义函数