启动hive服务
1:使用服务的形式启动hive
nohup hiveserver2 >/dev/null 2>&1 &
2:链接方式
nohup hiveserver2 &
1:beenline
hive --service beeline
!connect jdbc:hive2://hdp2:10000 root 123456
2:beeline -u jdbc:hive2://hdp2:10000 -n root
在做小数据量查询测试时,可以让hive将mrjob提交给本地运行器运行,可以在hive会话中设置如下参数:
hive> set hive.exec.mode.local.auto=true;
1、创建库
hive中有一个默认的库:default
新建库:
create database db_order;
库建好后,在hdfs中会生成一个库目录:
2、表操作
2.1、创建表、字段分隔符为 “,”
create table t_order(id string,create_time string,amount float,uid string)
row format delimited
fields terminated by ',';
2.2、查看表结构
desc tablename
2.3删除表
drop table t_order;
hive会从元数据库中清除关于这个表的信息;
hive会从hdfs中删除这个表的表目录;
2.4、内部表及外部表
2.4.1、创建内部表
create table t_user(id string,name string)
row format delimited
fields terminated by ',';
2.4.2、创建外部表
create external table t_access(ip String,url String,access_time String)
row format delimited
fields terminated by ','
location "/data/acc";
2.4.3、外部表跟内部表的区别:
- 内部表的目录在hive的仓库目录中 VS 外部表的目录由用户指定
- 内部表放在warehouse下面,删除表的时候会把数据删除掉
- 外部表示需要自己制定目录,删除表的时候,不会删除数据
一个hive的数据仓库,最底层的表,一定是来自于外部系统,为了不影响外部系统的工作逻辑,在hive中可建external表来映射这些外部系统产生的数据目录;
然后,后续的etl操作,产生的各种中间表建议用managed_table(内部表)
2.5、分区表
在表目录中为数据文件创建分区子目录,以便于在查询时,可以针对分区子目录中的数据进行处理,缩减读取数据的范围。
create table t_access(id string,url string,access_time string)
partitioned by(dt string)
row format delimited
fields terminated by ',';
导入数据
load data local inpath "/root/acc.txt" into table t_access partition(dt="2018-8-28");
注意:分区字段,不能出现在表字段里面,不同的分区数据存放在不同的目录下面
2.6、CTAS建表语法
2.6.1、 通过已存在表来建表,新建的表结构定义与源表一致,但是没有数据
create table t_user_2 like t_user;
2.6.2、在建表的同时插入数据
create table t_access_user
as
select ip,url from t_access;
2.7、查询表数据
2.7.1、基本条件查询
select * from t_access;
select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';
2.7.2、join关联查询
假如有a.txt文件
a,1
b,2
c,3
d,4
假如有b.txt文件
a,xx
b,yy
d,zz
e,pp
1、inner join(join)
select
a.name as aname,
a.age as aage,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name
+--------+--------+--------+--------+--+
| aname | anumb | bname | bnick |
+--------+--------+--------+--------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
+--------+--------+--------+--------+--+
2、left outer join(left join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name
+--------+--------+--------+--------+--+
| aname | anumb | bname | bnick |
+--------+--------+--------+--------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
+--------+--------+--------+--------+--+
3、right outer join(right join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name
+--------+--------+--------+--------+--+
| aname | anumb | bname | bnick |
+--------+--------+--------+--------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
| NULL | NULL |e | pp |
+--------+--------+--------+--------+--+
4、full outer join(full join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;
+--------+--------+--------+--------+--+
| aname | anumb | bname | bnick |
+--------+--------+--------+--------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| c | 3 | NULL | NULL |
| d | 4 | d | zz |
| NULL | NULL |e | pp |
+--------+--------+--------+--------+--+
5、left semi join
Left semi join :相当于join连接两个表后产生的数据中的左半部分
hive中不支持exist/IN子查询,可以用left semi join来实现同样的效果:
select
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;
+--------+--------+--+
| aname | anumb |
+--------+--------+--+
| a | 1 |
| b | 2 |
| d | 4 |
+--------+--------+--+
注意: left semi join的 select子句中,不能有右表的字段
2.7.3、分组聚合查询
假设有数据如下:
20180805,192.168.20.18,http://www.edu36.cn/job
20180805,192.168.20.28,http://www.edu36.cn/login
20180805,192.168.20.28,http://www.edu36.cn/login
20180806,192.168.20.38,http://www.edu36.cn/job
20180806,192.168.20.38,http://www.edu36.cn/study
select dt,count(*),max(ip) as cnt from t_access group by dt;
select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170805';
select
dt,count(*),max(ip) as cnt
from t_access
where url='http://www.edu360.cn/job'
group by dt having dt>'20170805';
注意: 一旦有group by子句,那么,在select子句中就不能有 (分组字段,聚合函数) 以外的字段
having是对group by聚合之后的结果进行再过滤;
2.7.4、子查询
假设有数据如下:
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
select id,name,brother
from
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;
3、数据导入导出及文件格式
3.1、 load导入数据
通过本地(linux上加载数据)
load data local inpath "/root/userdata.txt" into table t_user;
从hdfs上加载数据
load data inpath "/user.data2.txt" into table t_user;
3.2、 导出数据
导出到本地
insert overwrite local directory "/root/access-data"
row format delimited fields terminated by ":"
select * from t_user;
导出到hdfs上
insert overwrite directory "/access-data"
row format delimited
fields terminated by ','
select * from t_user;
3.3、hive文件格式
HIVE支持很多种文件格式: SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE
stored as textfile; //默认的存储格式
stored as sequencefile;
stored as parquetfile;
create table t_user(id string,name string)
row format delimited
fields terminated by ','
stored as sequencefile;
4、hive数据类型
4.1、数字类型
TINYINT (1字节整数)
SMALLINT (2字节整数)
INT/INTEGER (4字节整数)
BIGINT (8字节整数)
FLOAT (4字节浮点数)
DOUBLE (8字节双精度浮点数)
4.2、时间类型
TIMESTAMP (时间戳) (包含年月日时分秒的一种封装)
DATE (日期)(只包含年月日)
4.3、字符串类型
STRING
VARCHAR (字符串1-65355长度,超长截断)
CHAR (字符串,最大长度255)
4.4、复合类型
4.4.1、array数组类型
arrays: ARRAY<data_type>
例:
假设有数据如下:三生三世十里桃花,刘亦菲:痒痒,2017-08-20
create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';
select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'w');
select moive_name,size(actors) from t_movie;
4.4.2、map类型
maps: MAP<primitive_type, data_type>
例:
假设有数据如下:3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
select * from t_person;
## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;
## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;
## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;
4.4.3、struct类型
structs: STRUCT<col_name : data_type, …>
例:
假设有数据如下:2,lisi,28:female:shanghai
create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
select * from t_person_struct;
select id,name,info.age from t_person_struct;
4.5、其他类型
BOOLEAN(布尔类型):true false
BINARY (二进制):
5、hive函数使用
5.1、常用内置函数
5.1.1、类型转换函数
select cast("5" as int) ;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);
5.1.2、数学运算函数
select round(5.4); ## 5 四舍五入
select round(5.1345,3) ; ##5.135
select ceil(5.4) ; // select ceiling(5.4) from dual; ## 6 向上取整
select floor(5.4); ## 5 向下取整
select abs(-5.4) ; ## 5.4 绝对值
select greatest(3,5,6) ; ## 6
select least(3,5,6) from dual; ##求多个输入参数中的最小值
select max(age) from t_person;
select min(age) from t_person;
5.1.3、字符串函数
substr(string str, int start) ## 截取子串
substring(string str, int start)
substr(string, int start, int len)
substring(string, int start, int len)
concat(string A, string B...) ## 拼接字符串
concat_ws(string SEP, string A, string B...)
length(string A)
split(string str, string pat)
upper(string str) ##转大写
lower(string str)
示例:
select substr("abcdefg",2) from dual;
select substr("abcdefg",2,3) from dual;
select concat("ab","xy") from dual; ## abxy
select concat_ws(".","192","168","33","44") from dual; ## 192.168.33.44
select length("192.168.33.44") from dual; ## 13
select split("192.168.33.44","\\.") from dual;
5.1.4、时间函数
select current_timestamp; ## 获取当前的时间戳(详细时间信息)
select current_date; ## 获取当前的日期
## 取当前时间的秒数时间戳--(距离格林威治时间1970-1-1 0:0:0秒的差距)
select unix_timestamp();
## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");
5.1.5、条件函数
- case when
语法:
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
…
WHEN conditionn THEN resultn
ELSE result
END
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;
- if
select id,if(age>25,'working','worked') from t_user;
select moive_name,if(array_contains(actors,'吴刚'),'好电影',’烂片儿’)
from t_movie;
5.1.6、集合函数
array_contains(Array, value) 返回boolean值
示例:
select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c') from dual;
sort_array(Array) 返回排序后的数组
示例:
select sort_array(array('c','b','a')) from dual;
select 'haha',sort_array(array('c','b','a')) as xx from (select 0) tmp;
size(Array) 返回一个集合的长度,int值
示例:
select moive_name,size(actors) as actor_number from t_movie;
size(Map<K.V>) 返回一个int值
map_keys(Map<K.V>) 返回一个map字段的所有key,结果类型为:数组
map_values(Map<K.V>) 返回一个map字段的所有value,结果类型为:数组
5.1.7、聚合函数
sum
avg
max
min
count
5.1.8、表生成函数
5.1.9、json解析函数
假设有如下数据:
{“movie”:“1193”,“rate”:“5”,“timeStamp”:“978300760”,“uid”:“1”}
{“movie”:“661”,“rate”:“3”,“timeStamp”:“978302109”,“uid”:“1”}
{“movie”:“914”,“rate”:“3”,“timeStamp”:“978301968”,“uid”:“1”}
1、创建一个原始表用来对应原始的json数据
create table t_json(json string);
load data local inpath ‘/root/rating.json’ into table t_json;
2、利用json_tuple进行json数据解析
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json limit 10;
3、真正解析整张json表,将解析结果数据插入一张新表
create table t_movie_rate
as
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json;
5.1.10、分析函数 row_number() over()——分组TOPN
假设有如下数据:
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
需要查询出每种性别中年龄最大的2条数据
create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ‘,’;
select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber;
select id,age,name,sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;
5.2、自定义函数