本篇主要介绍将存储到Mysql的示例数据库Sakila以Sqoop的方式导入到Hive,然后详细讲解Hive的分析函数和窗口函数。
一 实战环境
1 Hive版本
hive> select version();
OK
2.3.3 r8a511e3f79b43d4be41cd231cf5c99e43b248383
Time taken: 0.944 seconds, Fetched: 1 row(s)
hive>
2 Mysql数据库
mysql> select database();
+------------+
| database() |
+------------+
| sakila |
+------------+
1 row in set (0.11 sec)
注:本实战使用Mysql示例数据库,具体安装可参考:MySQL之示例数据库Sakila下载及安装。
3 数据准备
1)方法一:MySQL——>HDFS——>Hive
--将Sakila下的City导入至HDFS
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --warehouse-dir /user/sqoop1 -m 1
--查看已导入的数据
[hadoop@strong ~]$ hdfs dfs -cat /user/sqoop1/city/part-m-00000
--在Hive创建city表
hive> create table city(city_id int, city string,country_id int,last_update string)
> row format delimited
> fields terminated by ',';
OK
Time taken: 1.386 seconds
--加载HDFS中的city文件
hive> load data inpath '/user/sqoop1/city/' into table city;
Loading data to table hive.city
OK
Time taken: 4.764 seconds
--查看导入到Hive的数据
hive> select *from city limit 10;
OK
1 A Corua (La Corua) 87 2006-02-15 04:45:25.0
2 Abha 82 2006-02-15 04:45:25.0
3 Abu Dhabi 101 2006-02-15 04:45:25.0
4 Acua 60 2006-02-15 04:45:25.0
5 Adana 97 2006-02-15 04:45:25.0
6 Addis Abeba 31 2006-02-15 04:45:25.0
7 Aden 107 2006-02-15 04:45:25.0
8 Adoni 44 2006-02-15 04:45:25.0
9 Ahmadnagar 44 2006-02-15 04:45:25.0
10 Akishima 50 2006-02-15 04:45:25.0
Time taken: 10.489 seconds, Fetched: 10 row(s)
2)方法二:MySQL——>Hive
--删除HDFS文件
[hadoop@strong ~]$ hdfs dfs -rm -R /user/hadoop/city
--删除City表
hive> drop table city;
OK
Time taken: 0.995 seconds
--将Sakila下的City导入至Hive下的City表
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --hive-import --hive-database hive --create-hive-table --fields-terminated-by ',' -m 1
注:会出现以下错误:Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
解决办法:[hadoop@strong ~]$ cp /usr/local/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar /usr/local/sqoop-1.4.7/lib/
--查看导入Hive的数据
hive> select *from city limit 10;
OK
1 A Corua (La Corua) 87 2006-02-15 04:45:25.0
2 Abha 82 2006-02-15 04:45:25.0
3 Abu Dhabi 101 2006-02-15 04:45:25.0
4 Acua 60 2006-02-15 04:45:25.0
5 Adana 97 2006-02-15 04:45:25.0
6 Addis Abeba 31 2006-02-15 04:45:25.0
7 Aden 107 2006-02-15 04:45:25.0
8 Adoni 44 2006-02-15 04:45:25.0
9 Ahmadnagar 44 2006-02-15 04:45:25.0
10 Akishima 50 2006-02-15 04:45:25.0
Time taken: 0.544 seconds, Fetched: 10 row(s)
--查看HDFS文件
[hadoop@strong ~]$ hdfs dfs -ls /user/hadoop/city
ls: `/user/hadoop/city': No such file or directory
注:查看Sqoop Import执行的信息18/06/28 12:34:38 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.可看出不会创建HDFS文件,执行过程中HDFS文件只是一个过渡,是个临时文件,完成后即删除。
二 Hive分析函数实战
1 样例数据
hive> select *from city where country_id in (101,102,108);
OK
city.city_id city.city city.country_id city.last_update
3 Abu Dhabi 101 2006-02-15 04:45:25.0
12 al-Ayn 101 2006-02-15 04:45:25.0
88 Bradford 102 2006-02-15 04:45:25.0
149 Dundee 102 2006-02-15 04:45:25.0
280 Kragujevac 108 2006-02-15 04:45:25.0
312 London 102 2006-02-15 04:45:25.0
368 Novi Sad 108 2006-02-15 04:45:25.0
470 Sharja 101 2006-02-15 04:45:25.0
494 Southampton 102 2006-02-15 04:45:25.0
495 Southend-on-Sea 102 2006-02-15 04:45:25.0
496 Southport 102 2006-02-15 04:45:25.0
500 Stockport 102 2006-02-15 04:45:25.0
589 York 102 2006-02-15 04:45:25.0
Time taken: 0.359 seconds, Fetched: 13 row(s)
2 Row_Number函数
hive> select t.*,row_number()over(partition by country_id) rnk from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update rnk
3 Abu Dhabi 101 2006-02-15 04:45:25.0 1
470 Sharja 101 2006-02-15 04:45:25.0 2
12 al-Ayn 101 2006-02-15 04:45:25.0 3
589 York 102 2006-02-15 04:45:25.0 1
500 Stockport 102 2006-02-15 04:45:25.0 2
496 Southport 102 2006-02-15 04:45:25.0 3
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 4
494 Southampton 102 2006-02-15 04:45:25.0 5
312 London 102 2006-02-15 04:45:25.0 6
149 Dundee 102 2006-02-15 04:45:25.0 7
88 Bradford 102 2006-02-15 04:45:25.0 8
280 Kragujevac 108 2006-02-15 04:45:25.0 1
368 Novi Sad 108 2006-02-15 04:45:25.0 2
Time taken: 63.74 seconds, Fetched: 13 row(s)
3 Rank函数
--加上partition by
hive> select t.*,rank()over(partition by country_id order by city_id) rnk from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update rnk
3 Abu Dhabi 101 2006-02-15 04:45:25.0 1
12 al-Ayn 101 2006-02-15 04:45:25.0 2
470 Sharja 101 2006-02-15 04:45:25.0 3
88 Bradford 102 2006-02-15 04:45:25.0 1
149 Dundee 102 2006-02-15 04:45:25.0 2
312 London 102 2006-02-15 04:45:25.0 3
494 Southampton 102 2006-02-15 04:45:25.0 4
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 5
496 Southport 102 2006-02-15 04:45:25.0 6
500 Stockport 102 2006-02-15 04:45:25.0 7
589 York 102 2006-02-15 04:45:25.0 8
280 Kragujevac 108 2006-02-15 04:45:25.0 1
368 Novi Sad 108 2006-02-15 04:45:25.0 2
Time taken: 57.047 seconds, Fetched: 13 row(s)
--不加上partition by
hive> select t.*,rank()over(order by city_id) rnk from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update rnk
3 Abu Dhabi 101 2006-02-15 04:45:25.0 1
12 al-Ayn 101 2006-02-15 04:45:25.0 2
88 Bradford 102 2006-02-15 04:45:25.0 3
149 Dundee 102 2006-02-15 04:45:25.0 4
280 Kragujevac 108 2006-02-15 04:45:25.0 5
312 London 102 2006-02-15 04:45:25.0 6
368 Novi Sad 108 2006-02-15 04:45:25.0 7
470 Sharja 101 2006-02-15 04:45:25.0 8
494 Southampton 102 2006-02-15 04:45:25.0 9
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 10
496 Southport 102 2006-02-15 04:45:25.0 11
500 Stockport 102 2006-02-15 04:45:25.0 12
589 York 102 2006-02-15 04:45:25.0 13
Time taken: 73.697 seconds, Fetched: 13 row(s)
--使用Rank排序会出现跳号的情况,即
hive> select t.*,rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update rnk
3 Abu Dhabi 101 2006-02-15 04:45:25.0 1
470 Sharja 101 2006-02-15 04:45:25.0 1
12 al-Ayn 101 2006-02-15 04:45:25.0 1
589 York 102 2006-02-15 04:45:25.0 4
500 Stockport 102 2006-02-15 04:45:25.0 4
496 Southport 102 2006-02-15 04:45:25.0 4
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 4
494 Southampton 102 2006-02-15 04:45:25.0 4
312 London 102 2006-02-15 04:45:25.0 4
149 Dundee 102 2006-02-15 04:45:25.0 4
88 Bradford 102 2006-02-15 04:45:25.0 4
280 Kragujevac 108 2006-02-15 04:45:25.0 12
368 Novi Sad 108 2006-02-15 04:45:25.0 12
Time taken: 57.533 seconds, Fetched: 13 row(s)
注:Rank函数排序会出现跳号。
4 Dense_rank函数
hive> select t.*,dense_rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update rnk
3 Abu Dhabi 101 2006-02-15 04:45:25.0 1
470 Sharja 101 2006-02-15 04:45:25.0 1
12 al-Ayn 101 2006-02-15 04:45:25.0 1
589 York 102 2006-02-15 04:45:25.0 2
500 Stockport 102 2006-02-15 04:45:25.0 2
496 Southport 102 2006-02-15 04:45:25.0 2
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 2
494 Southampton 102 2006-02-15 04:45:25.0 2
312 London 102 2006-02-15 04:45:25.0 2
149 Dundee 102 2006-02-15 04:45:25.0 2
88 Bradford 102 2006-02-15 04:45:25.0 2
280 Kragujevac 108 2006-02-15 04:45:25.0 3
368 Novi Sad 108 2006-02-15 04:45:25.0 3
Time taken: 60.528 seconds, Fetched: 13 row(s)
注:Dense_rank函数排序不会出现跳号。
5 Lead函数
hive> select t.*, lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update next_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 12
12 al-Ayn 101 2006-02-15 04:45:25.0 470
470 Sharja 101 2006-02-15 04:45:25.0 NULL
88 Bradford 102 2006-02-15 04:45:25.0 149
149 Dundee 102 2006-02-15 04:45:25.0 312
312 London 102 2006-02-15 04:45:25.0 494
494 Southampton 102 2006-02-15 04:45:25.0 495
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 496
496 Southport 102 2006-02-15 04:45:25.0 500
500 Stockport 102 2006-02-15 04:45:25.0 589
589 York 102 2006-02-15 04:45:25.0 NULL
280 Kragujevac 108 2006-02-15 04:45:25.0 368
368 Novi Sad 108 2006-02-15 04:45:25.0 NULL
Time taken: 59.491 seconds, Fetched: 13 row(s)
注: lead(city_id)超出边界,将返回NULL值。
6 LAG函数
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update prior_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 NULL
12 al-Ayn 101 2006-02-15 04:45:25.0 3
470 Sharja 101 2006-02-15 04:45:25.0 12
88 Bradford 102 2006-02-15 04:45:25.0 NULL
149 Dundee 102 2006-02-15 04:45:25.0 88
312 London 102 2006-02-15 04:45:25.0 149
494 Southampton 102 2006-02-15 04:45:25.0 312
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 494
496 Southport 102 2006-02-15 04:45:25.0 495
500 Stockport 102 2006-02-15 04:45:25.0 496
589 York 102 2006-02-15 04:45:25.0 500
280 Kragujevac 108 2006-02-15 04:45:25.0 NULL
368 Novi Sad 108 2006-02-15 04:45:25.0 280
Time taken: 56.631 seconds, Fetched: 13 row(s)
--指定偏移量,默认偏移量为1
hive> select t.*, lag(city_id,2,'No Data')over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update next_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 NULL
12 al-Ayn 101 2006-02-15 04:45:25.0 NULL
470 Sharja 101 2006-02-15 04:45:25.0 3
88 Bradford 102 2006-02-15 04:45:25.0 NULL
149 Dundee 102 2006-02-15 04:45:25.0 NULL
312 London 102 2006-02-15 04:45:25.0 88
494 Southampton 102 2006-02-15 04:45:25.0 149
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 312
496 Southport 102 2006-02-15 04:45:25.0 494
500 Stockport 102 2006-02-15 04:45:25.0 495
589 York 102 2006-02-15 04:45:25.0 496
280 Kragujevac 108 2006-02-15 04:45:25.0 NULL
368 Novi Sad 108 2006-02-15 04:45:25.0 NULL
Time taken: 66.487 seconds, Fetched: 13 row(s)
--返回当前值的前一个和后一个
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city,lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update prior_city next_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 NULL 12
12 al-Ayn 101 2006-02-15 04:45:25.0 3 470
470 Sharja 101 2006-02-15 04:45:25.0 12 NULL
88 Bradford 102 2006-02-15 04:45:25.0 NULL 149
149 Dundee 102 2006-02-15 04:45:25.0 88 312
312 London 102 2006-02-15 04:45:25.0 149 494
494 Southampton 102 2006-02-15 04:45:25.0 312 495
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 494 496
496 Southport 102 2006-02-15 04:45:25.0 495 500
500 Stockport 102 2006-02-15 04:45:25.0 496 589
589 York 102 2006-02-15 04:45:25.0 500 NULL
280 Kragujevac 108 2006-02-15 04:45:25.0 NULL 368
368 Novi Sad 108 2006-02-15 04:45:25.0 280 NULL
Time taken: 52.446 seconds, Fetched: 13 row(s)
7 First_Value函数
hive> select t.*, first_value(city_id)over(partition by country_id order by city_id) first_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update first_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 3
12 al-Ayn 101 2006-02-15 04:45:25.0 3
470 Sharja 101 2006-02-15 04:45:25.0 3
88 Bradford 102 2006-02-15 04:45:25.0 88
149 Dundee 102 2006-02-15 04:45:25.0 88
312 London 102 2006-02-15 04:45:25.0 88
494 Southampton 102 2006-02-15 04:45:25.0 88
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 88
496 Southport 102 2006-02-15 04:45:25.0 88
500 Stockport 102 2006-02-15 04:45:25.0 88
589 York 102 2006-02-15 04:45:25.0 88
280 Kragujevac 108 2006-02-15 04:45:25.0 280
368 Novi Sad 108 2006-02-15 04:45:25.0 280
Time taken: 78.285 seconds, Fetched: 13 row(s)
8 Last_value函数
hive> select t.*, last_value(city_id)over(partition by country_id order by city_id rows between unbounded preceding and unbounded following) last_city from city t where country_id in(101,102,108);
t.city_id t.city t.country_id t.last_update last_city
3 Abu Dhabi 101 2006-02-15 04:45:25.0 470
12 al-Ayn 101 2006-02-15 04:45:25.0 470
470 Sharja 101 2006-02-15 04:45:25.0 470
88 Bradford 102 2006-02-15 04:45:25.0 589
149 Dundee 102 2006-02-15 04:45:25.0 589
312 London 102 2006-02-15 04:45:25.0 589
494 Southampton 102 2006-02-15 04:45:25.0 589
495 Southend-on-Sea 102 2006-02-15 04:45:25.0 589
496 Southport 102 2006-02-15 04:45:25.0 589
500 Stockport 102 2006-02-15 04:45:25.0 589
589 York 102 2006-02-15 04:45:25.0 589
280 Kragujevac 108 2006-02-15 04:45:25.0 368
368 Novi Sad 108 2006-02-15 04:45:25.0 368
Time taken: 49.807 seconds, Fetched: 13 row(s)
注:在使用first_value和last_value时需加上窗口rows between unbounded preceding and unbounded following,否则会出现预想不到的结果!!!因为默认是
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。
9 Count函数
hive> select country_id,count(1) cqty from city where country_id in(101,102,108) group by country_id;
country_id cqty
101 3
102 8
108 2
Time taken: 56.987 seconds, Fetched: 3 row(s)
10 Sum函数
hive> select country_id,sum(city_id) csum from city where country_id in(101,102,108) group by country_id;
country_id csum
101 485
102 3123
108 648
Time taken: 57.993 seconds, Fetched: 3 row(s)
11 Avg函数
hive> select country_id,avg(city_id) cavg from city where country_id in(101,102,108) group by country_id;
country_id cavg
101 161.66666666666666
102 390.375
108 324.0
Time taken: 54.42 seconds, Fetched: 3 row(s)
12 Max函数
hive> select country_id,max(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_id cmax
101 470
102 589
108 368
Time taken: 58.02 seconds, Fetched: 3 row(s)
13 Min函数
hive> select country_id,min(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_id cmax
101 3
102 88
108 280
Time taken: 57.602 seconds, Fetched: 3 row(s)
注:以上5个函数都可以在后面添加Over和窗口。
窗口有如下规范:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING