目前 hive 支持的复合数据类型有以下几种:
arrays: ARRAY<\data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)
maps: MAP<\primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)
structs: STRUCT<\col_name : data_type [COMMENT col_comment], …>
union: UNIONTYPE<\data_type, data_type, …> (Note: Only available starting with Hive 0.7.0.)
一、map、struct、array 这3种的用法:
1、Array的使用
创建数据库表,以array作为数据类型
create table person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';
数据
biansutao beijing,shanghai,tianjin,hangzhou
linan changchu,chengdu,wuhan
入库数据
LOAD DATA LOCAL INPATH '/home/hadoop/person.txt' OVERWRITE INTO TABLE person;
查询
hive> select * from person;
biansutao ["beijing","shanghai","tianjin","hangzhou"]
linan ["changchu","chengdu","wuhan"]
Time taken: 0.355 seconds
hive> select name from person;
linan
biansutao
Time taken: 12.397 seconds
hive> select work_locations[0] from person;
changchu
beijing
Time taken: 13.214 seconds
hive> select work_locations from person;
["changchu","chengdu","wuhan"]
["beijing","shanghai","tianjin","hangzhou"]
Time taken: 13.755 seconds
hive> select work_locations[3] from person;
NULL
hangzhou
Time taken: 12.722 seconds
hive> select work_locations[4] from person;
NULL
NULL
Time taken: 15.958 seconds
2、Map 的使用
创建数据库表
create table score(name string, score map<string,int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
要入库的数据
biansutao '数学':80,'语文':89,'英语':95
jobs '语文':60,'数学':80,'英语':99
入库数据
LOAD DATA LOCAL INPATH '/home/hadoop/score.txt' OVERWRITE INTO TABLE score;
查询
hive> select * from score;
biansutao {"数学":80,"语文":89,"英语":95}
jobs {"语文":60,"数学":80,"英语":99}
Time taken: 0.665 seconds
hive> select name from score;
jobs
biansutao
Time taken: 19.778 seconds
hive> select t.score from score t;
{"语文":60,"数学":80,"英语":99}
{"数学":80,"语文":89,"英语":95}
Time taken: 19.353 seconds
hive> select t.score['语文'] from score t;
60
89
Time taken: 13.054 seconds
hive> select t.score['英语'] from score t;
99
95
Time taken: 13.769 seconds
3、Struct 的使用
创建数据表
CREATE TABLE test(id int,course struct<course:string,score:int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';
数据
1 english,80
2 math,89
3 chinese,95
入库
LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test;
查询
hive> select * from test;
OK
1 {"course":"english","score":80}
2 {"course":"math","score":89}
3 {"course":"chinese","score":95}
Time taken: 0.275 seconds
hive> select course from test;
{"course":"english","score":80}
{"course":"math","score":89}
{"course":"chinese","score":95}
Time taken: 44.968 seconds
select t.course.course from test t;
english
math
chinese
Time taken: 15.827 seconds
hive> select t.course.score from test t;
80
89
95
Time taken: 13.235 seconds
4、数据组合
CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);
SELECT foo FROM union_test;
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
create table test1(id int,a MAP<STRING,ARRAY<STRING>>)
row format delimited fields terminated by '\t'
collection items terminated by ','
MAP KEYS TERMINATED BY ':';
1 english:80,90,70
2 math:89,78,86
3 chinese:99,100,82
LOAD DATA LOCAL INPATH '/home/hadoop/test1.txt' OVERWRITE INTO TABLE test1;
二、hive中的一些不常见函数的用法:
常见的函数就不废话了,和标准sql类似,下面我们要聊到的基本是HQL里面专有的函数,
hive里面的函数大致分为如下几种:Built-in、Misc.、UDF、UDTF、UDAF
我们就挑几个标准SQL里没有,但是在HIVE SQL在做统计分析常用到的来说吧。
1、array_contains (Collection Functions)
这是内置的对集合进行操作的函数,用法举例:
create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string,
email string,sd string, ed string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/hive/dw';
select * from (select * from userinfo where sex='male' and
!array_contains(split('1,2,3,4,5',','),cast(id as string))) tb1 where tb1.age < 30;
2、get_json_object (Misc. Functions)
测试数据:
first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third
first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],
"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third
first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],
"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} third
create external table if not exists t_json(
f1 string, f2 string, f3 string)
row format delimited fields TERMINATED BY ' '
location '/test/json'
select get_json_object(t_json.f2, '$.owner') from t_json;
SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9;
SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json;
这里尤其要注意UDTF的问题,官方文档有说明:
json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string,
and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string.
which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;
should be changed to
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;
UDTF(User-Defined Table-Generating Functions) 用来解决 输入一行输出多行(On-to-many maping) 的需求。
通过Lateral view可以方便的将UDTF得到的行转列的结果集合在一起提供服务,因为直接在SELECT使用UDTF会存在限制,
即仅仅能包含单个字段,不光是多个UDTF,仅仅单个UDTF加上其他字段也是不可以,
hive提示在UDTF中仅仅能有单一的表达式。如下:
hive> select my_test(“abcef:aa”) as qq,’abcd’ from sunwg01;
FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF’s
使用Lateral view可以实现上面的需求,Lateral view语法如下:
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)*
fromClause: FROM baseTable (lateralView)*
hive> create table sunwg ( a array, b array )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> COLLECTION ITEMS TERMINATED BY ‘,’;
OK
Time taken: 1.145 seconds
hive> load data local inpath ‘/home/hjl/sunwg/sunwg.txt’ overwrite into table sunwg;
Copying data from file:/home/hjl/sunwg/sunwg.txt
Loading data to table sunwg
OK
Time taken: 0.162 seconds
hive> select * from sunwg;
OK
[10,11] ["tom","mary"]
[20,21] ["kate","tim"]
Time taken: 0.069 seconds
hive>
> SELECT a, name
> FROM sunwg LATERAL VIEW explode(b) r1 AS name;
OK
[10,11] tom
[10,11] mary
[20,21] kate
[20,21] tim
Time taken: 8.497 seconds
hive> SELECT id, name
> FROM sunwg LATERAL VIEW explode(a) r1 AS id
> LATERAL VIEW explode(b) r2 AS name;
OK
10 tom
10 mary
11 tom
11 mary
20 kate
20 tim
21 kate
21 tim
Time taken: 9.687 seconds
3、parse_url_tuple
测试数据:
url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example
&gs_l=serp.j4.8.serp0Hs&bav=on.
create external table if not exists t_url(f1 string, f2 string)
row format delimited fields TERMINATED BY ' '
location '/test/url';
SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;
结果:
url1 facebook.com /path1/p.php k1=v1&k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL
url3 www.google.com.hk / NULL NULL
4、explode
explode 是一个 hive 内置的表生成函数:Built-in Table-Generating Functions (UDTF),
主要是解决 1 to N 的问题,即它可以把一行输入拆成多行,
比如一个 array 的每个元素拆成一行,作为一个虚表输出。它有如下需要注意的地方:
Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
从上面的原理与语法上可知,
select 列中不能 udtf 和其它非 udtf 列混用,
udtf 不能嵌套,
不支持 GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY
还有 select 中出现的 udtf 一定需要列别名,否则会报错:
SELECT explode(myCol) AS myNewCol FROM myTable;
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;
SELECT explode(myCol) AS pos, myNewCol FROM myTable;
5、lateral view
lateral view 是Hive中提供给UDTF的conjunction,
它可以解决UDTF不能添加额外的select列的问题。
当我们想对hive表中某一列进行split之后,想对其转换成1 to N的模式,即一行转多列。
hive不允许我们在UDTF函数之外,再添加其它select语句。
如下,我们想将登录某个游戏的用户id放在一个字段user_ids里,对每一行数据用UDTF后输出多行。
select game_id, explode(split(user_ids,'\\[\\[\\[')) as user_id from login_game_log where dt='2014-05-15' ;
FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions。
提示语法分析错误,UDTF不支持函数之外的select 语句,如果我们想支持怎么办呢?接下来就是Lateral View 登场的时候了。
Lateral view 其实就是用来和像类似explode这种UDTF函数联用的。
lateral view 会将UDTF生成的结果放到一个虚拟表中,然后这个虚拟表(1 to N)会和输入行即每个game_id进行join
来达到连接UDTF外的select字段的目的(源表和拆分的虚表按行做行内 1 join N 的直接连接),
这也是为什么 LATERAL VIEW udtf(expression) 后面需要表别名和列别名的原因。
Lateral View Syntax
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
可以看出,可以在2个地方用Lateral view:
在udtf前面用
在from baseTable后面用
例如:
pageid adid_list
front_page [1, 2, 3]
contact_page [3, 4, 5]
SELECT pageid, adid
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;
pageid adid
front_page 1
front_page 2
front_page 3
contact_page 3
contact_page 4
contact_page 5
From语句后可以跟多个Lateral View。
A FROM clause can have multiple LATERAL VIEW clauses.
Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.
给定数据:
Array<int> col1 Array<string> col2
[1, 2] [a", "b", "c"]
[3, 4] [d", "e", "f"]
转换目标:
想同时把第一列和第二列拆开,类似做笛卡尔乘积。
我们可以这样写:
SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;
还有一种情况,如果UDTF转换的Array是空的怎么办呢?
在Hive0.12里面会支持outer关键字,如果UDTF的结果是空,默认会被忽略输出。
如果加上outer关键字,则会像left outer join 一样,还是会输出select出的列,而UDTF的输出结果是NULL。
总结:
Lateral View通常和UDTF一起出现,为了解决UDTF不允许在select字段的问题。
Multiple Lateral View可以实现类似笛卡尔乘积。
Outer关键字可以把不输出的UDTF的空结果,输出成NULL,防止丢失数据。