1.hive中基本操作;
DDL,DML
2.hive中函数
User-Defined Functions : UDF(用户自定义函数,简称JDF函数)
UDF: 一进一出 upper lower substring(进来一条记录,出去还是一条记录)
UDAF:Aggregation(用户自定的聚合函数) 多进一出 count max min sum ...
UDTF: Table-Generation 一进多出
3.举例
show functions显示系统支持的函数
行数举例:split(),explode()
exercise:使用hive统计单词出现次数
explode把数组转成多行的数据
[hadoop@hadoop000 data]$ vi hive-wc.txt
hello,world,welcome
hello,welcome
hive> create table hive_wc(sentence string); OK Time taken: 1.083 seconds hive> load data local inpath '/home/hadoop/data/hive-wc.txt' into table hive_wc; Loading data to table default.hive_wc Table default.hive_wc stats: [numFiles=1, totalSize=35] OK Time taken: 1.539 seconds hive> select * from hive_wc; OK hello,world,welcome hello,welcome Time taken: 0.536 seconds, Fetched: 3 row(s)
hive> select split(sentence,",") from hive_wc; OK ["hello","world","welcome"] ["hello","welcome"] [""] Time taken: 0.161 seconds, Fetched: 3 row(s)
"hello" "world" "welcome" "hello" "welcome"
用一个SQL完成wordcount统计:
hive> select word, count(1) as c > from (select explode(split(sentence,",")) as word from hive_wc) t > group by word ; Query ID = hadoop_20180613094545_920c2e72-5982-47eb-9a9c-5e5a30ebb1ae Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1528851144815_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528851144815_0001/ Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528851144815_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-06-13 10:18:53,155 Stage-1 map = 0%, reduce = 0% 2018-06-13 10:18:59,605 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.42 sec 2018-06-13 10:19:07,113 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.31 sec MapReduce Total cumulative CPU time: 4 seconds 310 msec Ended Job = job_1528851144815_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.31 sec HDFS Read: 7333 HDFS Write: 29 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 310 msec OK 1 hello 2 welcome 2 world 1 Time taken: 26.859 seconds, Fetched: 4 row(s)
4.json类型数据
使用到的文件: rating.json
创建一张表 rating_json,上传数据,并查看前十行数据信息:
hive> create table rating_json(json string); OK hive> load data local inpath '/home/hadoop/data/rating.json' into table rating_json; Loading data to table default.rating_json Table default.rating_json stats: [numFiles=1, totalSize=34967552] OK hive> select * from rating_json limit 10; OK {"movie":"1193","rate":"5","time":"978300760","userid":"1"} {"movie":"661","rate":"3","time":"978302109","userid":"1"} {"movie":"914","rate":"3","time":"978301968","userid":"1"} {"movie":"3408","rate":"4","time":"978300275","userid":"1"} {"movie":"2355","rate":"5","time":"978824291","userid":"1"} {"movie":"1197","rate":"3","time":"978302268","userid":"1"} {"movie":"1287","rate":"5","time":"978302039","userid":"1"} {"movie":"2804","rate":"5","time":"978300719","userid":"1"} {"movie":"594","rate":"4","time":"978302268","userid":"1"} {"movie":"919","rate":"4","time":"978301368","userid":"1"} Time taken: 0.195 seconds, Fetched: 10 row(s)
对json的数据进行处理,json_tuple 是一个UDTF是 Hive0.7版本引进的:
hive> select > json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid) > from rating_json limit 10; OK 1193 5 978300760 1 661 3 978302109 1 914 3 978301968 1 3408 4 978300275 1 2355 5 978824291 1 1197 3 978302268 1 1287 5 978302039 1 2804 5 978300719 1 594 4 978302268 1 919 4 978301368 1 Time taken: 0.189 seconds, Fetched: 10 row(s)
5.时间类型的转换:
[hadoop@hadoop000 data]$ more hive_row_number.txt 1,18,ruoze,M 2,19,jepson,M 3,22,wangwu,F 4,16,zhaoliu,F 5,30,tianqi,M 6,26,wangba,F [hadoop@hadoop000 data]$
hive> create table hive_rownumber(id int,age int, name string, sex string) > row format delimited fields terminated by ','; OK Time taken: 0.451 seconds hive> load data local inpath '/home/hadoop/data/hive_row_number.txt' into table hive_rownumber; Loading data to table hive3.hive_rownumber Table hive3.hive_rownumber stats: [numFiles=1, totalSize=84] OK Time taken: 1.381 seconds hive> select * from hive_rownumber ; OK 1 18 ruoze M 2 19 jepson M 3 22 wangwu F 4 16 zhaoliu F 5 30 tianqi M 6 26 wangba F Time taken: 0.455 seconds, Fetched: 6 row(s)
需求:查询出每种性别中年龄最大的两条数据 -- > topn:
分析:order by 是全局的排序,是做不到分组内的排序的 ;组内进行排序,就要用到窗口函数or分析函数
select id,age,name.sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc)
from hive_rownumber) t
where rank<=2;
hive> select id,age,name,sex > from > (select id,age,name,sex, > row_number() over(partition by sex order by age desc) as rank > from hive_rownumber) t > where rank<=2; Query ID = hadoop_20180614202525_9829dc42-3c37-4755-8b12-89c416589ebc Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1528975858636_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528975858636_0001/ Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528975858636_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-06-14 20:26:18,582 Stage-1 map = 0%, reduce = 0% 2018-06-14 20:26:24,010 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.48 sec 2018-06-14 20:26:31,370 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.86 sec MapReduce Total cumulative CPU time: 3 seconds 860 msec Ended Job = job_1528975858636_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.86 sec HDFS Read: 8586 HDFS Write: 56 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 860 msec OK 6 26 wangba F 3 22 wangwu F 5 30 tianqi M 2 19 jepson M Time taken: 29.262 seconds, Fetched: 4 row(s)