【hive内置基本数据类型】和【内置复合数据类型用法】_内置数据类型分为基本数据类型和组合数据类型-CSDN博客

【hive内置数据类型】

Hive的内置数据类型可以分为两大类：(1)、基础数据类型；(2)、复杂数据类型。其中，基础数据类型包括：TINYINT,SMALLINT,INT,BIGINT,BOOLEAN,FLOAT,DOUBLE,STRING,BINARY,TIMESTAMP,DECIMAL,CHAR,VARCHAR,DATE。下面的表格列出这些基础类型所占的字节以及从什么版本开始支持这些类型。

数据类型	所占字节	开始支持版本
TINYINT	1byte，-128 ~ 127
SMALLINT	2byte，-32,768 ~ 32,767
INT	4byte,-2,147,483,648 ~ 2,147,483,647
BIGINT	8byte,-9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807
BOOLEAN
FLOAT	4byte单精度
DOUBLE	8byte双精度
STRING
BINARY		从Hive0.8.0开始支持
TIMESTAMP		从Hive0.8.0开始支持
DECIMAL		从Hive0.11.0开始支持
CHAR		从Hive0.13.0开始支持
VARCHAR		从Hive0.12.0开始支持
DATE		从Hive0.12.0开始支持

　　复杂类型包括ARRAY,MAP,STRUCT,UNION，这些复杂类型是由基础类型组成的。

　　 ARRAY：ARRAY类型是由一系列相同数据类型的元素组成，这些元素可以通过下标来访问。比如有一个ARRAY类型的变量fruits，它是由['apple','orange','mango']组成，那么我们可以通过fruits[1]来访问元素orange，因为ARRAY类型的下标是从0开始的；
　　 MAP：MAP包含key->value键值对，可以通过key来访问元素。比如”userlist”是一个map类型，其中username是key，password是value；那么我们可以通过userlist['username']来得到这个用户对应的password；
　　 STRUCT：STRUCT可以包含不同数据类型的元素。这些元素可以通过”点语法”的方式来得到所需要的元素，比如user是一个STRUCT类型，那么可以通过user.address得到这个用户的地址。
　　 UNION: UNIONTYPE，他是从Hive 0.7.0开始支持的。

　　创建一个包含复制类型的表格可以如下

[java]view plaincopy 
   
 
   
 CREATE TABLE employees (  
     name STRING,  
     salary FLOAT,  
     subordinates ARRAY<STRING>,  
     deductions MAP<STRING, FLOAT>,  
     address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>  
 ) PARTITIONED BY (country STRING, state STRING);  

【复合数据类型用法】

一、map、struct、array 这3种的用法：
1、Array的使用
2、Map 的使用
3、Struct 的使用
4、数据组合（不支持组合的复杂数据类型）
二、hive中的一些不常见函数的用法：
1、array_contains （Collection Functions）
2、get_json_object （Misc. Functions）
3、parse_url_tuple
三、ref：

目前 hive 支持的复合数据类型有以下几种：

map
(key1, value1, key2, value2, ...) Creates a map with the given key/value pairs
struct
(val1, val2, val3, ...) Creates a struct with the given field values. Struct field names will be col1, col2, ...
named_struct
(name1, val1, name2, val2, ...) Creates a struct with the given field names and values. (as of Hive 0.8.0)
array
(val1, val2, ...) Creates an array with the given elements
create_union
(tag, val1, val2, ...) Creates a union type with the value that is being pointed to by the tag parameter

一、map、struct、array 这3种的用法：

1、Array的使用

 
         创建数据库表，以array作为数据类型 
        
 
         create 
         table  
          person( 
         name 
         string,work_locations array<string>) 
        
 
         ROW FORMAT DELIMITED 
        
 
         FIELDS TERMINATED  
         BY 
         '\t' 
        
 
         COLLECTION ITEMS TERMINATED  
         BY 
         ',' 
         ; 
        
 
         数据 
        
 
         biansutao beijing,shanghai,tianjin,hangzhou 
        
 
         linan changchu,chengdu,wuhan 
        
 
         入库数据 
        
 
         LOAD 
         DATA  
         LOCAL 
         INPATH  
         '/home/hadoop/person.txt' 
         OVERWRITE  
         INTO 
         TABLE 
          person; 
        
 
         查询 
        
 
         hive> 
         select 
         *  
         from 
         person; 
        
 
         biansutao       [ 
         "beijing" 
         , 
         "shanghai" 
         , 
         "tianjin" 
         , 
         "hangzhou" 
         ] 
        
 
         linan   [ 
         "changchu" 
         , 
         "chengdu" 
         , 
         "wuhan" 
         ] 
        
 
         Time 
         taken: 0.355 seconds 
        
 
         hive> 
         select 
         name 
          from 
          person; 
        
 
         linan 
        
 
         biansutao 
        
 
         Time 
         taken: 12.397 seconds 
        
 
         hive> 
         select 
         work_locations[0]  
         from 
         person; 
        
 
         changchu 
        
 
         beijing 
        
 
         Time 
         taken: 13.214 seconds 
        
 
         hive> 
         select 
         work_locations  
         from 
         person;    
        
 
         [ 
         "changchu" 
         , 
         "chengdu" 
         , 
         "wuhan" 
         ] 
        
 
         [ 
         "beijing" 
         , 
         "shanghai" 
         , 
         "tianjin" 
         , 
         "hangzhou" 
         ] 
        
 
         Time 
         taken: 13.755 seconds 
        
 
         hive> 
         select 
         work_locations[3]  
         from 
         person; 
        
 
         NULL 
        
 
         hangzhou 
        
 
         Time 
         taken: 12.722 seconds 
        
 
         hive> 
         select 
         work_locations[4]  
         from 
         person; 
        
 
         NULL 
        
 
         NULL 
        
 
         Time 
         taken: 15.958 seconds 
        

2、Map 的使用

 
         创建数据库表 
        
 
         create 
         table 
          score( 
         name 
         string, score map<string, 
         int 
         >) 
        
 
         ROW FORMAT DELIMITED 
        
 
         FIELDS TERMINATED  
         BY 
         '\t' 
        
 
         COLLECTION ITEMS TERMINATED  
         BY 
         ',' 
        
 
         MAP KEYS TERMINATED  
         BY 
         ':' 
         ; 
        
 
         要入库的数据 
        
 
         biansutao 
         '数学' 
         :80, 
         '语文' 
         :89, 
         '英语' 
         :95 
        
 
         jobs 
         '语文' 
         :60, 
         '数学' 
         :80, 
         '英语' 
         :99 
        
 
         入库数据 
        
 
         LOAD 
         DATA  
         LOCAL 
         INPATH  
         '/home/hadoop/score.txt' 
         OVERWRITE  
         INTO 
         TABLE 
          score; 
        
 
         查询 
        
 
         hive> 
         select 
         *  
         from 
         score; 
        
 
         biansutao       { 
         "数学" 
         :80, 
         "语文" 
         :89, 
         "英语" 
         :95} 
        
 
         jobs    { 
         "语文" 
         :60, 
         "数学" 
         :80, 
         "英语" 
         :99} 
        
 
         Time 
         taken: 0.665 seconds 
        
 
         hive> 
         select 
         name 
          from 
          score; 
        
 
         jobs 
        
 
         biansutao 
        
 
         Time 
         taken: 19.778 seconds 
        
 
         hive> 
         select 
         t.score  
         from 
         score t; 
        
 
         { 
         "语文" 
         :60, 
         "数学" 
         :80, 
         "英语" 
         :99} 
        
 
         { 
         "数学" 
         :80, 
         "语文" 
         :89, 
         "英语" 
         :95} 
        
 
         Time 
         taken: 19.353 seconds 
        
 
         hive> 
         select 
         t.score[ 
         '语文' 
         ] 
         from 
         score t; 
        
 
         60 
        
 
         89 
        
 
         Time 
         taken: 13.054 seconds 
        
 
         hive> 
         select 
         t.score[ 
         '英语' 
         ] 
         from 
         score t; 
        
 
         99 
        
 
         95 
        
 
         Time 
         taken: 13.769 seconds 
        

3、Struct 的使用

 
         创建数据表 
        
 
         CREATE 
         TABLE 
          test(id  
         int 
         ,course struct<course:string,score: 
         int 
         >) 
        
 
         ROW FORMAT DELIMITED 
        
 
         FIELDS TERMINATED  
         BY 
         '\t' 
        
 
         COLLECTION ITEMS TERMINATED  
         BY 
         ',' 
         ; 
        
 
         数据 
        
 
         1 english,80 
        
 
         2 math,89 
        
 
         3 chinese,95 
        
 
         入库 
        
 
         LOAD 
         DATA  
         LOCAL 
         INPATH  
         '/home/hadoop/test.txt' 
         OVERWRITE  
         INTO 
         TABLE 
          test; 
        
 
         查询 
        
 
         hive> 
         select 
         *  
         from 
         test; 
        
 
         OK 
        
 
         1       { 
         "course" 
         : 
         "english" 
         , 
         "score" 
         :80} 
        
 
         2       { 
         "course" 
         : 
         "math" 
         , 
         "score" 
         :89} 
        
 
         3       { 
         "course" 
         : 
         "chinese" 
         , 
         "score" 
         :95} 
        
 
         Time 
         taken: 0.275 seconds 
        
 
         hive> 
         select 
         course  
         from 
         test; 
        
 
         { 
         "course" 
         : 
         "english" 
         , 
         "score" 
         :80} 
        
 
         { 
         "course" 
         : 
         "math" 
         , 
         "score" 
         :89} 
        
 
         { 
         "course" 
         : 
         "chinese" 
         , 
         "score" 
         :95} 
        
 
         Time 
         taken: 44.968 seconds 
        
 
         select 
         t.course.course  
         from 
         test t;  
        
 
         english 
        
 
         math 
        
 
         chinese 
        
 
         Time 
         taken: 15.827 seconds 
        
 
         hive> 
         select 
         t.course.score  
         from 
         test t; 
        
 
         80 
        
 
         89 
        
 
         95 
        
 
         Time 
         taken: 13.235 seconds<span style= 
         "font-family:微软雅黑, Verdana, sans-serif, 宋体;font-size:12.5px;line-height:1.5;" 
         >  </span> 
        

4、数据组合（不支持组合的复杂数据类型）

 
         LOAD 
         DATA  
         LOCAL 
         INPATH  
         '/home/hadoop/test.txt' 
         OVERWRITE  
         INTO 
         TABLE 
          test; 
        
 
         create 
         table 
          test1(id  
         int 
         ,a MAP<STRING,ARRAY<STRING>>) 
        
 
         row format delimited fields terminated  
         by 
         '\t' 
        
 
         collection items terminated  
         by 
         ',' 
        
 
         MAP KEYS TERMINATED  
         BY 
         ':' 
         ; 
        
 
         1 english:80,90,70 
        
 
         2 math:89,78,86 
        
 
         3 chinese:99,100,82 
        
 
         LOAD 
         DATA  
         LOCAL 
         INPATH  
         '/home/hadoop/test1.txt' 
         OVERWRITE  
         INTO 
         TABLE 
          test1; 
        

二、hive中的一些不常见函数的用法：

常见的函数就不废话了，和标准sql类似，下面我们要聊到的基本是HQL里面专有的函数，

hive里面的函数大致分为如下几种：Built-in、Misc.、UDF、UDTF、UDAF

我们就挑几个标准SQL里没有，但是在HIVE SQL在做统计分析常用到的来说吧。

1、array_contains （Collection Functions）

这是内置的对集合进行操作的函数，用法举例：

1

2

3

4

 
         create 
         EXTERNAL  
         table 
         IF  
         NOT 
         EXISTS userInfo (id  
         int 
         ,sex string, age  
         int 
         , 
         name 
         string, email string,sd string, ed string)  ROW FORMAT DELIMITED FIELDS TERMINATED 
         BY 
         '\t' 
          location  
         '/hive/dw' 
         ; 
        

            
        
 
         select 
         *  
         from 
         userinfo  
         where 
         sex= 
         'male' 
         and 
          (id!=1  
         and 
         id !=2  
         and 
         id!=3  
         and 
         id!=4  
         and 
         id!=5)  
         and 
         age < 30; 
        
 
         select 
         *  
         from 
         ( 
         select 
         *  
         from 
         userinfo  
         where 
         sex= 
         'male' 
         and 
          !array_contains(split( 
         '1,2,3,4,5' 
         , 
         ',' 
         ), 
         cast 
         (id 
         as 
         string))) tb1  
         where 
         tb1.age < 30; 
        

其中建表所用的测试数据你可以用如下链接的脚本自动生成：

http://my.oschina.net/leejun2005/blog/76631

2、get_json_object （Misc. Functions）

测试数据：

first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third
first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third
first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} third

1

2

3

4

 
         create 
         external  
         table 
         if  
         not 
         exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED 
         BY 
         ' ' 
          location  
         '/test/json' 
        
 
         select 
         get_json_object(t_json.f2,  
         '$.owner' 
         ) 
         from 
         t_json; 
        
 
         SELECT 
         *  
         from 
         t_json  
         where 
         get_json_object(t_json.f2,  
         '$.store.fruit[0].weight' 
         ) = 9; 
        
 
         SELECT 
         get_json_object(t_json.f2,  
         '$.non_exist_key' 
         ) 
         FROM 
         t_json; 
        

这里尤其要注意UDTF的问题，官方文档有说明：

json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.

For example,

1	`select` `a.` `timestamp` `, get_json_object(a.appevents,` `'$.eventid'` `), get_json_object(a.appenvets,` `'$.eventname'` `)` `from` `log a;`

should be changed to

1 2	`select` `a.` `timestamp` `, b.*` `from` `log a lateral` `view` `json_tuple(a.appevent,` `'eventid'` `,` `'eventname'` `) b` `as` `f1, f2;`

UDTF(User-Defined Table-Generating Functions) 用来解决输入一行输出多行(On-to-many maping) 的需求。

通过Lateral view可以方便的将UDTF得到的行转列的结果集合在一起提供服务，因为直接在SELECT使用UDTF会存在限制，即仅仅能包含单个字段，不光是多个UDTF，仅仅单个UDTF加上其他字段也是不可以，hive提示在UDTF中仅仅能有单一的表达式。如下：
hive> select my_test(“abcef:aa”) as qq,’abcd’ from sunwg01;
FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF’s

使用Lateral view可以实现上面的需求，Lateral view语法如下：
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)*
fromClause: FROM baseTable (lateralView)*
hive> create table sunwg ( a array, b array )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> COLLECTION ITEMS TERMINATED BY ‘,’;
OK
Time taken: 1.145 seconds
hive> load data local inpath ‘/home/hjl/sunwg/sunwg.txt’ overwrite into table sunwg;
Copying data from file:/home/hjl/sunwg/sunwg.txt
Loading data to table sunwg
OK
Time taken: 0.162 seconds
hive> select * from sunwg;
OK
[10,11] ["tom","mary"]
[20,21] ["kate","tim"]
Time taken: 0.069 seconds
hive>
> SELECT a, name
> FROM sunwg LATERAL VIEW explode(b) r1 AS name;
OK
[10,11] tom
[10,11] mary
[20,21] kate
[20,21] tim
Time taken: 8.497 seconds

hive> SELECT id, name
> FROM sunwg LATERAL VIEW explode(a) r1 AS id
> LATERAL VIEW explode(b) r2 AS name;
OK
10 tom
10 mary
11 tom
11 mary
20 kate
20 tim
21 kate
21 tim
Time taken: 9.687 seconds

3、parse_url_tuple

测试数据：

url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example&gs_l=serp.3...10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0...0.0...1c.1j4.8.serp.0B9C1T_n0Hs&bav=on.2,or.&bvm=bv.44770516,d.aGc&fp=e13e41a6b9dab3f6&biw=1241&bih=589

1

2

 
         create 
         external  
         table 
         if  
         not 
         exists t_url(f1 string, f2 string) row format delimited fields TERMINATED  
         BY 
         ' ' 
          location  
         '/test/url' 
         ; 
        

 
         SELECT 
         f1, b.*  
         FROM 
         t_url LATERAL  
         VIEW 
         parse_url_tuple(f2,  
         'HOST' 
         , 
         'PATH' 
         , 
         'QUERY' 
         , 
         'QUERY:k1' 
         ) b  
         as 
         host, path, query, query_id; 
        

结果：

url1 facebook.com /path1/p.php k1=v1&k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL
url3 www.google.com.hk / NULL NULL