Hive 函数小集合

(1)explode()函数(配合使用:Lateral View):

explode(array)函数接受array类型的参数,其作用恰好与collect_set相反,实现将array类型数据单列转多行或多列。explode(ARRAY)  列表中的每个元素生成一行;
explode(MAP) map中每个key-value对,生成一行,key为一列,value为一列;

限制:

1、No other expressions are allowed in SELECT
   SELECT pageid, explode(adid_list) AS myCol... is not supported;

2、UDTF's can't be nested
   SELECT explode(explode(adid_list)) AS myCol... is not supported;3、GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
   SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported;

2、lateral view
可使用lateral view解除以上限制,语法如下:

fromClause: FROM baseTable (lateralView)*
lateralView: LATERAL VIEW explode(expression) tableAlias AS columnAlias (',' columnAlias)*

解释一下:
Lateral view 其实就是用来和像类似explode这种UDTF函数联用的Lateral view 会将UDTF生成的结果放到一个虚拟表中,然后这个虚拟表会和输入行即每个game_id进行join 来达到连接UDTF外的select字段的目的。

案例:table名称为pageAds
SELECT pageid, adid
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;

3.多个lateral view
from语句后面可以带多个lateral view语句

案例:
表名:baseTable

from后只有一个lateral view:
SELECT myCol1, col2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1;

多个lateral view:
SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;

4、Outer Lateral Views
如果array类型的字段为空,但依然需返回记录,可使用outer关键词。

比如:select * from src LATERAL VIEW explode(array()) C AS a limit 10;
这条语句中的array字段是个空列表,这条语句不管src表中是否有记录,结果都是空的。

而:select * from src LATERAL VIEW OUTER explode(array()) C AS a limit 10;
结果中的记录数为src表的记录数,只是a字段为NULL。
比如:

238 val_238 NULL
86 val_86 NULL
311 val_311 NULL
27 val_27 NULL
165 val_165 NULL
409 val_409 NULL

1.列转行

1.1 问题引入:
如何将
a       b       1,2,3
c       d       4,5,6
变为:
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6 

1.2 原始数据: 

test.txt
a b 1,2,3
c d 4,5,6

 1.3 解决方法1:

drop table test_jzl_20140701_test;
建表:
create table test_jzl_20140701_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by ' '
stored as textfile;

加载数据:
load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test;

查看表中所有数据:
select * from test_jzl_20140701_test;  
a       b       1,2,3
c       d       4,5,6

遍历数组中的每一列
select col1,col2,name 
from test_jzl_20140701_test  
lateral view explode(split(col3,',')) col3 as name;
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6 

解决方法2:

drop table test_jzl_20140701_test1;
建表:
create table test_jzl_20140701_test1
(
col1 string,
col2 string,
col3 array<int>
)
row format delimited 
fields terminated by ' '
collection items terminated by ','   //定义数组的分隔符
stored as textfile;

加载数据:
load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test1;

查看表中所有数据:
select * from test_jzl_20140701_test1; 
a       b       [1,2,3]
c       d       [4,5,6]

遍历数组中的每一列:
select col1,col2,name 
from test_jzl_20140701_test1  
lateral view explode(col3) col3 as name;
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

1.4补充知识点: 

select * from test_jzl_20140701_test; 
a       b       1,2,3
c       d       4,5,6

select t.list[0],t.list[1],t.list[2] 
from (
      select (split(col3,',')) list from test_jzl_20140701_test
     )t;

OK
1       2       3
4       5       6

--查看数组长度
select size(split(col3,',')) list from test_jzl_20140701_test;
3
3

列转行2: explode(array);

select explode(array('A','B','C'));
A
B
C
select explode(array('A','B','C')) as col;
col
A
B
C
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
A
B
C
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;
col
A
B
C

explode(map):

select explode(map('A',10,'B',20,'C',30));
A	10
B	20
C	30
select explode(map('A',10,'B',20,'C',30)) as (key,value);
key value
A	10
B	20
C	30
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
A	10
B	20
C	30
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
key value
A	10
B	20
C	30

New Add函数: inline(): inline和explode函数都可以将单列扩展成多列或者多行。

inline的参数形式:inline(ARRAY<STRUCT[,STRUCT]>)
inline一般结合lateral view使用。

select t1.col1 as name,t1.col2 as sub1
from employees 
lateral view inline(array(struct(name,subordinates[0]))) t1

3007211-ff56e905173db998.png

inline 嵌套多个struct:

select t1.col1 as name,t1.col2 as sub
from employees 
lateral view inline(array(struct(name,subordinates[0]),
                          struct(name,subordinates[1]))) t1
where t1.col2 is not null

3007211-73f40121292d992f.png

还可以给inline的字段取别名:

select t1.name,t1.sub
from employees 
lateral view inline(array(struct(name,subordinates[0]),
                          struct(name,subordinates[1]))) t1 as name,sub
where t1.sub is not null

3007211-2488aeef2659cde8.png

(2)Collect_Set()/Collect_List()函数;

collect_set/collect_list函数:
collect_set(col)函数只接受基本数据类型,函数只能接受一列参数,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。
collect_list函数返回的类型是array< ? >类型,?表示该列的类型 
PS:collect_list()不去重汇总,collect_set()去重汇总;

 

2.行转列:

2.1问题引入:
hive如何将
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
变为:
a       b       1,2,3
c       d       4,5,6

2,2原始数据: 

test.txt
a       b       1 
a       b       2 
a       b       3 
c       d       4 
c       d       5 
c       d       6

2.3 解决方法1: 

drop table tmp_jiangzl_test;
建表:
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by '\t'
stored as textfile;
加载数据:
load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;
处理:
select col1,col2,concat_ws(',',collect_set(col3)) 
from tmp_jiangzl_test  
group by col1,col2;

Collect_List()/Collect_Set()几点备注:

1.我们使用了concat_ws函数,但是该函数仅支持string和array< string > 所以对于该列不是string的列,需要先转为string类型. "而collect_list函数返回的类型是array< ? >类型,?表示该列的类型", 那怎么转为string类型?

select id,concat_ws(',',collect_list(cast (name as string))) from table group by id

本例中,concat_ws函数的作用是用逗号将所有的string组合到一起,用逗号分割显示。
这样,就实现了将列转行的功效,最终实现了我们需要的输出结果。

2.当collect_list()/collect_set()的参数列中包含空值(NULL)时,collect_list/collect_set函数会忽略空值,而只处理非空的值,此时需加个空值判断。如下面的例子:

I am trying to collect a column with NULLs along with some values in that column...But collect_list ignores the NULLs and collects only the ones with values in it. Is there a way to retrieve the NULLs along with other values ?

SELECT col1, col2, collect_list(col3) as col3
FROM (SELECT * FROM table_1 ORDER BY col1, col2, col3)
GROUP BY col1, col2;

Actual col3 values:

0.9
NULL
NULL
0.7
0.6 

Resulting col3 values:

[0.9, 0.7, 0.6]

I was hoping that there is a hive solution that looks like this [0.9, NULL, NULL, 0.7, 0.6] after applying the collect_list.

Answer:

This function works like this, but I've found the following workaround. Add a case when statement to your query to check and keep NULLs.

SELECT col1, 
    col2, 
    collect_list(CASE WHEN col3 IS NULL THEN 'NULL' ELSE col3 END) as col3
--或者 
--  collect_list(coalesce(col3, "NULL") as col3 
FROM (SELECT * FROM table_1 ORDER BY col1, col2, col3)
GROUP BY col1, col2

Now, because you had a string element ('NULL') the whole result set is an array of strings. At the end just convert the array of strings to an array of double values.

 

(3)日期类型函数;

(3.1)Unix_timestamp():Gets current time stamp using the default time zone. 
函数返回值的类型:bigint;

(1)select unix_timestamp();
1510302062

Unix_timestamp(string date) Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp, return 0 if fail: unix_timestamp(‘2009-03-20 11:30:01’) = 1237573801;
函数返回值的类型:bigint;

select unix_timestamp('2017-11-10 16:24:00');
1510302240
-- 修改时间格式后:
select unix_timestamp('2017-11-10');
NULL

Unix_timestamp(string date,string pattern): Convert time string with given pattern to Unix time stamp, return 0 if fail: unix_timestamp(‘2009-03-20’, ‘yyyy-MM-dd’) = 1237532400;
函数返回值的类型:bigint;

select unix_timestamp('2017-11-10','yyyy-MM-dd');
1510243200

select unix_timestamp('2017-11-10 12:30:10','yyyy-MM-dd HH:mm:ss');
1510288210

(3.2)from_unixtime(bigint unixtime[, string format]): Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of “1970-01-01 00:00:00”;
函数返回值的类型:string;

select from_unixtime(1510243200),from_unixtime(1510243200,'yyyy-MM-dd');
       2017-11-10 00:00:00       2017-11-10

select from_unixtime(1510288210),from_unixtime(1510288210,'yyyy-MM-dd'),from_unixtime(1510288210,'yyyy-MM-dd HH:mm:ss');
       2017-11-10 12:30:10        2017-11-10                                  2017-11-10 12:30:10

(3.3)to_date(string timestamp): Returns the date part of a timestamp string: 
                                            to_date(“1970-01-01 00:00:00”) = “1970-01-01”;
函数返回值的类型:Hive 2.1.0版本之前 string, Hive 2.1.0版本之后: date;

select to_date('2017-11-10 16:48:20');
2017-11-10

(3.4)year(string date): Returns the year part of a date or a timestamp string: 
                               year(“1970-01-01 00:00:00”) = 1970, year(“1970-01-01”) = 1970;
函数返回值的类型:int;

select year("2017-11-10"),year("2017-11-10 16:54:30");
       2017               2017

(3.5)month(string date): Returns the month part of a date or a timestamp string: 
                                  month(“1970-11-01 00:00:00”) = 11, month(“1970-11-01”) = 11
函数返回值的类型:int;

select month("2017-11-10"),month("2017-11-10 16:54:30");
       11                  11

(3.6)day(string date) dayofmonth(date): Return the day part of a date or a timestamp string: 
                                                        day(“1970-11-01 00:00:00”) = 1, day(“1970-11-01”) = 1
函数返回值的类型:int;

select day("2017-11-10 16:54:30"),day("2017-11-10");
       10                         10
select dayofmonth("2017-11-10 17:02:30"),dayofmonth("2017-11-10");
       10                                10

(3.7)Hour(string date): Returns the hour of the timestamp: 
                                hour(‘2009-07-30 12:58:59′) = 12, hour(’12:58:59’) = 12
函数返回值的类型:int;

select hour("2017-11-10 17:02:30"),hour("17:02:30");
       17                          17

(3.8)minute(string date):Returns the minute of the timestamp; 函数返回值的类型:int;

select minute("2017-11-10 17:02:30"),minute("17:02:30");
       2                              2

(3.9)second(string date):Returns the second of the timestamp; 函数返回值的类型:int;

select second("2017-11-10 17:02:30"),second("17:02:30");
       30                             30

(4.0)weekofyear(string date): Return the week number of a timestamp string: 
                                    weekofyear(“1970-11-01 00:00:00”) = 44, weekofyear(“1970-11-01”) = 44
函数返回值的类型:int;

select weekofyear("2017-11-10 17:02:30"),weekofyear("2017-11-10");
       45                                 45

(4.1)datediff(string enddate, string startdate): Return the number of days from startdate to enddate:                                                                   datediff(‘2009-03-01’, ‘2009-02-27’) = 2;
函数返回值的类型:int;

select datediff("2017-11-11","2017-11-10");
       1

(4.2)date_add(date/timestamp/string startdate, tinyint/smallint/int days): 
      Add a number of days to startdate: date_add(‘2008-12-31’, 1) = ‘2009-01-01’
函数返回值的类型:Hive 2.1.0版本之前 string, Hive 2.1.0版本之后: date;

select date_add("2017-11-10",1);
2017-11-11

(4.3)date_sub(date/timestamp/string startdate, tinyint/smallint/int days):
      Subtract a number of days to startdate: date_sub(‘2008-12-31’, 1) = ‘2008-12-30’;
函数返回值的类型:Hive 2.1.0版本之前 string, Hive 2.1.0版本之后: date;

select date_sub("2017-11-11",1);
2017-11-10

(4.4)from_utc_timestamp({any primitive type}*, string timezone): 不好理解;
     Assumes given timestamp is UTC and converts to given timezone (as of Hive 0.8.0);

     Coverts a timestamp* in UTC to a given timezone(as of Hive 0.8.0).*
     timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, 
     float/double and decimal. Fractional values are considered as seconds. 
     Integer values are considered as milliseconds.. 
     E.g:from_utc_timestamp(2592000.0,'PST');  -->1970-01-31 00:00:00;
          from_utc_timestamp(2592000000,'PST');  -->1970-01-31 00:00:00;
     and from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST'); -->1970-01-30 08:00:00
函数返回值的类型:timestamp;

select from_utc_timestamp(2592000.0,'PST');
1970-01-31 00:00:00
select from_utc_timestamp(2592000.0,'PST');
1970-01-31 00:00:00
select from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');
1970-01-30 08:00:00

(4.5)to_utc_timestamp({any primitive type}*, string timezone): 不好理解;
     Assumes given timestamp is in given timezone and converts to UTC (as of Hive 0.8.0);

     Coverts a timestamp* in a given timezone to UTC (as of Hive 0.8.0). 
   * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, 
     float/double and decimal. Fractional values are considered as seconds. 
     Integer values are considered as milliseconds.. 
     E.g: to_utc_timestamp(2592000.0,'PST');   -->1970-01-31 16:00:00
           to_utc_timestamp(2592000000,'PST'); -->1970-01-31 16:00:00
     and to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');  -->1970-01-31 00:00:00
函数返回值的类型:timestamp;

select to_utc_timestamp(2592000.0,'PST');
1970-01-31 16:00:00
select to_utc_timestamp(2592000000,'PST');
1970-01-31 16:00:00
select to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');
1970-01-31 00:00:00

(4.6)current_date(): Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
                            All calls of current_date within the same query return the same value.
函数返回值的类型:date;

select current_date();
2017-11-10

(4.7)current_timestamp(): Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
函数返回值的类型:timestamp;

select current_timestamp();
2017-11-10 17:49:59.992

(4.8)dd_months(string start_date, int num_months): Returns the date that is num_months after start_date (as of Hive 1.1.0). start_date is a string, date or timestamp. num_months is an integer. The time part of start_date is ignored. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date.
函数返回值的类型:string;

select add_months('2017-10-01',1);
2017-11-01
select add_months('2017-10-31',1);
2017-11-30
select add_months('2017-02-28',1);
2017-03-31

(4.9)last_day(string date): Returns the last day of the month which the date belongs to 
      (as of Hive 1.1.0). date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. 
      The time part of date is ignore;
函数返回值的类型:string;

select last_day('2017-11-10'),last_day("2017-11-10 17:48:20");
       2017-11-30	          2017-11-30

(5.0)next_day(string start_date, string day_of_week): Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored.  Example: next_day(‘2015-01-14’, ‘TU’) = 2015-01-20.
函数返回值的类型:string;

select next_day('2017-11-10','Wed');
2017-11-15

(5.1)trunc(string date, string format): Returns date truncated to the unit specified by the format (as of Hive 1.2.0).  Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. 
                 Example: trunc(‘2015-03-17’, ‘MM’) = 2015-03-01.
函数返回值的类型:string;

select trunc('2017-11-10','YYYY');
2017-01-01

select trunc('2017-11-10','MM');
2017-11-01

(5.2)months_between(date1, date2): Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer.Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH:mm:ss’. The result is rounded to 8 decimal places. Example: months_between(‘1997-02-28 10:30:00’, ‘1996-10-30’) = 3.94959677;
函数返回值的类型:double;

select months_between('2017-10-01','2017-11-01');
-1.0
select months_between('2017-11-01','2017-10-01');
1.0
select months_between('2017-11-01','2017-11-01');
0.0
select months_between('2017-11-02','2017-11-01');
0.03225806
select months_between('2017-10-31','2017-11-01');
-0.03225806
select months_between('2017-11-30','2017-10-31');
1.0


(5.3)date_format(date/timestamp/string ts, string fmt): Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). 
Supported formats are Java SimpleDateFormat 
formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html.(见下图)
The second argument fmt should be constant.
Example: date_format(‘2015-04-08’, ‘y’) = ‘2015’. 
date_format can be used to implement other UDFs, 
e.g.: dayname(date) is date_format(date, ‘EEEE’);  -->返回的是date是星期几(英文全拼)
dayofyear(date) is date_format(date, ‘D’); -->返回的是date是一年中的第几天;
函数返回值的类型:string;

115837_j8xd_3204727.png
 

(4)Mathematical Functions:

The following built-in mathematical functions are supported in Hive; most return NULL when the argument(s) are NULL:
(4.1)round(double a): Returns the rounded BIGINT value of the double;
函数返回值的类型:bigint;

select round(5.5),round(5.4);
       6           5

(4.2)round(double a, int d):Returns the double rounded to d decimal places;
函数返回值的类型:double;

select round(5.52,1),round(5.52,2),round(5.52,3);
        5.5           5.52          5.52

(4.2)+New add: bround(double a):银行家舍入法(1~4:舍,6~9:进,5->前位数是偶:舍,5->前位数是奇:进);
函数返回值的类型:double;

select bround(8.25,1),bround(8.35,1);
       8.2            8.4

(4.2)+New Add2: bround(double a, int d):

select bround(8.225,2),bround(8.235,2);
       8.22            8.24

(4.3)floor(double a):Returns the maximum BIGINT value that is equal or less than the double;
函数返回值的类型:bigint;

select floor(5.0),floor(5.2),floor(5.9);
       5           5          5 

(4.4)ceil(double a), ceiling(double a): 
Returns the minimum BIGINT value that is equal or greater than the double;
函数返回值的类型:bigint;

select ceil(5.0),ceil(5.01),ceil(5.99);
       5          6          6

(4.5)rand(), rand(int seed):
Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1. Specifiying the seed will make sure the generated random number sequence is deterministic.
函数返回值的类型:double;

select 1,rand();
第一次运行结果:1	0.16656453881197997
第二次运行结果:1	0.9930518648319836
第三次运行结果:1	0.4714714776339659
第四次运行结果:1	0.4895444194153318

select 1,rand(2);
第一次运行结果:1	0.7311469360199058
第二次运行结果:1	0.7311469360199058

(4.6)exp(double a),exp(decimal a): Returns e^a where e is the base of the natural logarithm;
函数返回值的类型:double;

select exp(0.1);
1.1051709180756477

select exp(2);
7.38905609893065

select exp(2.1);
8.166169912567652

(4.7)ln(double a), ln(DECIMAL a): Returns the natural logarithm of the argument;
函数返回值的类型:double;

select ln(2);
0.6931471805599453

select ln(2.3);
0.8329091229351039

(4.8)log10(double a), log10(DECIMAL a): Returns the base-10 logarithm of the argument
函数返回值的类型:double;

(4.9)log2(double a), log2(DECIMAL a): Returns the base-10 logarithm of the argument
函数返回值的类型:double;

(5.0)log(double base, double a), log(DECIMAL base, DECIMAL a): 
      Returns the base-10 logarithm of the argument
函数返回值的类型:double;

select log(2,4)
2.0

select log(2.1,8.2);
2.835999790572199

(5.1)pow(double a,double p),power(double a, double p): Return a^p;-->a的p次方;

(5.2)sqrt(double a), sqrt(DECIMAL a): Returns the square root of a;
函数返回值的类型:double;

(5.2)sqrt(double a), sqrt(DECIMAL a): Returns the square root of a;
函数返回值的类型:double;

select sqrt(100),sqrt(0.01);
       10         0.1

(5.3)bin(BIGINT a): Returns the number in binary format;   将"十进制"数据转换为"二进制"数;
函数返回值的类型:string;

select bin(10);
       1010

(5.4)hex(BIGINT a) hex(string a) hex(BINARY a): If the argument is an int, hex returns the number as a string in hex format. Otherwise if the number is a string, it converts each character into its hex representation and returns the resulting string. 
•十六进制函数 : hex;  
说明: 如果变量是int类型,那么返回a的十六进制表示;如果变量是string类型,则返回该字符串的十六进制表示;  函数返回值的类型:string;

select hex(10);
A(或者a)
select hex("A");
41
select hex("a");
61
select hex('ab');
6162

(5.5)unhex(string a): Inverse of hex. Interprets each pair of characters as a hexidecimal number and converts to the character represented by the number.
反转十六进制函数 : unhex;  说明: 返回该十六进制字符串所代码的字符串
函数返回值的类型:string;

select unhex('616263'),unhex(616263);
       abc             abc

(5.6)conv(BIGINT num, int from_base, int to_base), conv(STRING num, int from_base, int to_base):
Converts a number from a given base to another;
进制转换函数 : conv();
说明: 将数值num从from_base进制转化到to_base进制
函数返回值的类型:string;

select conv(17,10,16);  -- 将17 从10进制 转换为16进制数;
11
select conv(17,10,2);   -- 将17 从10进制 转换为2进制数;
10001

(5.7)abs(double a): Returns the absolute value;
绝对值函数 : abs; 说明: 返回数值a的绝对值
函数返回值的类型:double;

select abs(-3.9);
3.9
select abs(3.9);
3.9

(5.8)pmod(int a, int b) pmod(double a, double b):
     Returns the positive value of a mod b;
正取余函数 : pmod; 说明: 返回的a除以b的余数;
函数返回值的类型:int double;

select pmod(9,4);
1
select pmod(-9,4);
3

(5.9)sin(double a), sin(DECIMAL a): Returns the sine of a (a is in radians);
正弦函数 : sin;  说明: 返回a的正弦值
函数返回值的类型:double;

select sin(0.8);
0.7173560908995228

(6.0)sin(double a), asin(DECIMAL a): Returns the arc sin of x if -1<=a<=1 or null otherwise;
反正弦函数 : asin;  说明: 返回a的反正弦值
函数返回值的类型:double;

select asin(0.7173560908995228);
0.8

(6.1)cos(double a), cos(DECIMAL a): Returns the cosine of a (a is in radians);
余弦函数 : cos; 说明: 返回a的余弦值;
函数返回值的类型:double;

select cos(0.9);
0.6216099682706644

(6.2)acos(double a), acos(DECIMAL a): Returns the arc cosine of x if -1<=a<=1 or null otherwise;
反余弦函数 : acos; 说明: 返回a的反余弦值;
函数返回值的类型:double;

select acos(0.6216099682706644);
0.9

(6.3)tan(double a), tan(DECIMAL a): Returns the tangent of a (a is in radians);
正切函数 : tan;  说明: 返回a的正切值;
函数返回值的类型:double;

select tan(1);
1.5574077246549023

(6.4)atan(double a), atan(DECIMAL a): Returns the arctangent of a;
反正切函数 : atan; 说明: 返回a的反正切值;
函数返回值的类型:double;

select atan(1.5574077246549023);
1.0

(6.5)degrees(double a), degrees(DECIMAL a):
     Converts value of a from radians to degree; -->将弧度值转换角度值;
函数返回值的类型:double;

select degrees(30);
1718.8733853924698

(6.6)radians(double a): 
     Converts value of a from degrees to radians; -->将角度值转换成弧度值;
函数返回值的类型:double;

select radians(30);
0.5235987755982988

(6.7)positive(int a), positive(double a): Returns a;
函数返回值的类型: int double;

select positive(-10),positive(10);
        -10           10

(6.8)negative(int a), negative(double a): Returns -a;
函数返回值的类型: int double;

select negative(-10),negative(10);
       10             -10

(6.9)sign(double a), sign(DECIMAL a): Returns the sign of a as ‘1.0’ or ‘-1.0’
     如果a是正数则返回1.0,是负数则返回-1.0,否则返回0.0
函数返回值的类型: float;

select sign(-10),sign(0),sign(10);
       -1.0       0.0     1.0

(7.0)e(): Returns the value of e;  数学常数e;
函数返回值的类型: double;

select e();
2.718281828459045

(7.1)pi(): Returns the value of pi;  数学常数pi;
函数返回值的类型: double;

select pi();
3.141592653589793

(7.2)factorial(int a): 求a的阶乘 ;
函数返回值的类型: bigint;

select factorial(2);  <=> 2*1;
2
select factorial(3);  <=> 3*2*1;
6
select factorial(4);  <=> 4*3*2*1;
24

(7.3)cbrt(double a): 求a的立方根;
函数返回值的类型: double;

select cbrt(8),cbrt(27);
       2        3

(7.4)shiftleft(tinyint/smallint/int a, int b): 按位左移;
    Purpose: Shifts an integer value left by a specified number of bits. As the most significant 
    bit is taken out of the original value, it is discarded and the least significant bit becomes 0. 
    In computer science terms, this operation is a "logical shift".
Usage notes:
The final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero.
Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any value by 1 is the same as multiplying it by 2, as long as the value is small enough; larger values eventually become negative when shifted, as the sign bit is set. Starting with the value 1 and shifting it left by N positions gives the same result as 2 to the Nth power, or pow(2,N).

Return type: Same as the input value
Added in: CDH 5.5.0 (Impala 2.3.0)

select shiftleft(1,0); /* 00000001 -> 00000001 */
+-----------------+
| shiftleft(1, 0) |
+-----------------+
| 1               |
+-----------------+

select shiftleft(1,3); /* 00000001 -> 00001000 */
+-----------------+
| shiftleft(1, 3) |
+-----------------+
| 8               |
+-----------------+

select shiftleft(8,2); /* 00001000 -> 00100000 */
+-----------------+
| shiftleft(8, 2) |
+-----------------+
| 32              |
+-----------------+

select shiftleft(127,1); /* 01111111 -> 11111110 */
+-------------------+
| shiftleft(127, 1) |
+-------------------+
| -2                |
+-------------------+

select shiftleft(127,5); /* 01111111 -> 11100000 */
+-------------------+
| shiftleft(127, 5) |
+-------------------+
| -32               |
+-------------------+

select shiftleft(-1,4); /* 11111111 -> 11110000 */
+------------------+
| shiftleft(-1, 4) |
+------------------+
| -16              |
+------------------+

(7.5)shiftright(tinyint/smallint/int a, int b): 按位右移;
Purpose: Shifts an integer value right by a specified number of bits. As the least significant bit is taken out of the original value, it is discarded and the most significant bit becomes 0. In computer science terms, this operation is a "logical shift".
Usage notes:
Therefore, the final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero.
Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any positive value right by 1 is the same as dividing it by 2. Negative values become positive when shifted right.
Return type: Same as the input value
Added in: CDH 5.5.0 (Impala 2.3.0)

select shiftright(16,0); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 0) |
+-------------------+
| 16                |
+-------------------+

select shiftright(16,4); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 4) |
+-------------------+
| 1                 |
+-------------------+

select shiftright(16,5); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 5) |
+-------------------+
| 0                 |
+-------------------+

select shiftright(-1,1); /* 11111111 -> 01111111 */
+-------------------+
| shiftright(-1, 1) |
+-------------------+
| 127               |
+-------------------+

select shiftright(-1,5); /* 11111111 -> 00000111 */
+-------------------+
| shiftright(-1, 5) |
+-------------------+
| 7                 |
+-------------------+

(7.6)shiftrightunsigned(tinyint/smallint/int a, int b): 无符号按位右移(<<<);
      shiftrightunsigned(bigint a,int b)
函数返回值类型:int, bigint;

select shiftrightunsigned(16,4);
1

(7.7) greatest(v1,v2,v3,...):求最大值,如果其中包含空值,则返回空值;
      Returns the greatest value of the list of values (as of Hive 1.1.0). 
      Fixed to return NULL when one or more arguments are NULL, and strict type restriction relaxed, 
      consistent with ">" operator (as of Hive 2.0.0).

select greatest(2,3,1);
3
greatest(2,3,NULL);
NULL
select greatest('1','2','3');
3
select greatest('1','2','a');
a
select greatest('1','aa','a');
aa 
select greatest('zz','aa','a');
aa
select greatest('1a','aa','a');
aa 

(7.8)least(v1,v2,v3,...):求最小值;
     Returns the least value of the list of values (as of Hive 1.1.0). Fixed to return NULL when one 
     or more arguments are NULL, and strict type restriction relaxed, consistent with "<" operator 
     (as of Hive 2.0.0).

select least(1,2,3); --> 1 
select least(-100,2,-3); --> -100
select least('1','2','3'); --> 1
select least('1','2','a'); --> 1
select least('b','aa','a'); --> a 

(5)String Functions;

(5.1)ascii(string str):Returns the numeric value of the first character of str;返回str中首个ASCII字符串的整数值; 返回值类型:int;

select ascii('a'); --> 97 
select ascii('ab'); --> 97
select ascii('ba'); --> 98

(5.2)base64(binary bin):  -- 后续再补充这个函数; 返回值类型:string;
     Converts the argument from binary to a base- 64 string (as of 0.12.0);
     将二进制bin转换成64位的字符串;
If your parameter is already in binary, just use:  base64(bin_field);
Otherwise, if it is in text format and you want to convert it to Binary UTF-8 then to base 64, combine:  base64(encode(text_field, 'UTF-8'));

select base64(encode('a','UTF-8'));
YQ==
select base64(binary('1'));
MQ==

unbase64(string str): Converts the argument from a base 64 string to BINARY. (As of Hive 0.12.0.).
将64位的字符串转换二进制值; 
返回值类型:string;

select unbase64("MQ==");
1

(5.3)chr(bigint|double A): Returns the ASCII character having the binary equivalent to A (as of Hive 1.3.0 and 2.1.0). If A is larger than 256 the result is equivalent to chr(A % 256). Example: select chr(88); returns “X”.string; 返回值类型:string;

select chr(97);
a

(5.4)concat(string|binary A, string|binary B…):Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. e.g. concat(‘foo’, ‘bar’) results in ‘foobar’. Note that this function can take any number of input strings.  返回值类型:string;

select concat('a','b');
ab

(5.5)concat_ws(string SEP, string A, string B…):Like concat() above, but with custom separator SEP.

select concat_ws('-','a','b');
a-b

(5.6)concat_ws(string SEP, array<string>):Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0); 返回值类型:string;

(5.7)/(5.8):

(5.9)decode(binary bin, string charset):Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.);
使用指定的字符集charset将二进制值bin解码成字符串,支持的字符集有:'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16',如果任意输入参数为NULL都将返回NULL;
返回值类型:string;   --> 暂时不知道如何运用;
(6.0)encode(string src, string charset):Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.);
使用指定的字符集charset将字符串编码成二进制值,支持的字符集有:'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16',如果任一输入参数为NULL都将返回NULL;
返回值类型:binary;    --> 暂时不知道如何运用;
(6.1)elt(N int,str1 string,str2 string,str3 string,…): Return string at index number. 
For example elt(2,’hello’,’world’) returns ‘world’. Returns NULL if N is less than 1 or greater than the number of arguments.  
(See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt)
返回值类型:string;

select elt(1,'face','book'); -->face
select elt(0,'face','book'); -->NULL
select elt(3,'face','book'); -->NULL

(6.2)field(val T,val1 T,val2 T,val3 T,…):Returns the index of val in the val1,val2,val3,… list or 0 if not found. For example field(‘world’,’say’,’hello’,’world’) returns 3.All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0.  
(See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field)  

返回值类型:int;

select field('world','say','hello','world'); -->3
select field('world','say','hello'); --> 0 
-- 只显示第一个出现的位置;
select field('hi','hello','hi','how','hi'); --> 2

(6.3)find_in_set(string str, string strlist):Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set(‘ab’, ‘abc,b,ab,c,def’) returns 3;  返回值类型:int;

select find_in_set('ab', 'abc,b,ab,c,def'); --> 3
select find_in_set('a,b', 'abc,b,ab,c,def'); --> 0
select find_in_set('a,b', 'abc,NULL,ab,c,def'); --> 0
select find_in_set('NULL', 'abc,NULL,ab,c,def'); --> 2 
select find_in_set(NULL, 'abc,NULL,ab,c,def'); --> NULL

(6.4)format_number(number x, int d):Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (as of Hive 0.10.0);  返回值类型:string;

select format_number(100.1232,2); --> 100.12
select format_number(100.1262,2); --> 100.13

(6.5)get_json_object(string json_string, string path):Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.NOTE: The json path can only have the characters [0-9a-z_], i.e., no upper-case or special characters. Also, the keys *cannot start with numbers.* This is due to restrictions on Hive column names.从指定路径上的JSON字符串抽取出JSON对象,并返回这个对象的JSON格式,如果输入的JSON是非法的将返回NULL,注意此路径上JSON字符串只能由数字 字母 下划线组成且不能有大写字母和特殊字符,且key不能由数字开头,这是由于Hive对列名的限制;

返回值类型:string;

select  get_json_object('{"store":
                                  {"fruit":\[{"weight":8,"type":"apple"},
                                             {"weight":9,"type":"pear"}],
                                   "bicycle":{"price":19.95,"color":"red"}
                                  },
                         "email":"amy@only_for_json_udf_test.net",
                         "owner":"amy"
                         }
                        ','$.owner'
                        );
amy


select  get_json_object('{"store":
                                  {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
                                   "bicycle":{"price":19.95,"color":"red"}
                                  },
                         "email":"amy@only_for_json_udf_test.net",
                         "owner":"amy"
                         }
                        ','$.ownEr'
                        )
                         ;
NULL

(6.6)in_file(string str, string filename):Returns true if the string str appears as an entire line in filename. 如果文件名为filename的文件中有一行数据与字符串str匹配成功就返回true;返回值类型:boolean; --暂时没用到;

(6.7):initcap(string A):Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.);将字符串A转换第一个字母大写其余字母的字符串;     返回值类型:string;

select initcap('iniCapinp');
Inicapinp

(6.8)instr(string str, string substr):Returns the position of the first occurrence of substr in str. Returns null if either of the arguments are null and returns 0 if substr could not be found in str. Be aware that this is not zero based. The first character in str has index 1.查找字符串str中子字符串substr出现的位置,如果查找失败将返回0,如果任一参数为Null将返回null,注意位置为从1开始的;
返回值类型:int;

select instr('ambghabxyabef','ab'); --> 6

(6.9)length(string A):Returns the length of the string; 返回值类型:int;

select length('hello'); --> 5
select length('你好'); --> 2

(7.0)levenshtein(string A, string B):Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein(‘kitten’, ‘sitting’) results in 3. 计算两个字符串之间的差异大小 ;具体的比较方式是按字符串中的字符的顺序一个一个进行比较,相同位置上的字符不同则加1;
返回值类型:int;

select levenshtein('a','ab'); --> 1 
select levenshtein('ab','a'); --> 1 
select levenshtein('a','abc'); --> 2 
select levenshtein('abc','ad'); --> 2 
select levenshtein('abcg','adef'); --> 3
select levenshtein('abcgh','adehf'); --> 4
select levenshtein('abegh','adehf'); --> 3 
select levenshtein('abeh','adehf'); --> 2

(7.1)locate(string substr, string str[, int pos]):Returns the position of the first occurrence of substr in str after position pos; 查找字符串str中的pos位置后字符串substr第一次出现的位置; 返回值类型:int;

select locate('bar','foobarbar',5); --> 7

(7.2)lower(string A) lcase(string A): Returns the string resulting from converting all characters of B to lower case. For example, lower(‘fOoBaR’) results in ‘foobar’.  返回值类型:string;

(7.3)upper(string A) ucase(string A):Returns the string resulting from converting all characters of A to upper case e.g. upper(‘fOoBaR’) results in ‘FOOBAR’;  返回值类型:string;

(7.4)lpad(string str, int len, string pad):Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.从左边开始对字符串str使用字符串pad填充,最终len长度为止,如果字符串str本身长度比len大的话,将去掉多余的部分; 返回值类型:string;

select lpad('hellomike',10,'a'); --> ahellomike
select lpad('hellomike',15,'a'); --> aaaaaahellomike
select lpad('hellomike',15,'ab'); --> abababhellomike
select lpad('hellomike',16,'ab'); --> abababahellomike 

(7.5)rpad(string str, int len, string pad):Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.从右边开始对字符串str使用字符串pad填充,最终len长度为止,如果字符串str本身长度比len大的话,将去掉多余的部分;  返回值类型:string;

select rpad('hellomike',16,'ab'); --> hellomikeabababa
select rpad('hellomike',8,'ab'); --> hellomik

(7.6)ltrim(string A):Returns the string resulting from trimming spaces from the beginning(left hand side) of A e.g. ltrim(‘ foobar ‘) results in ‘foobar ‘; 去掉字符串A前面的空格; 返回值类型:string;

select ltrim('   hello  x'); --> hello  x
select rtrim('   hello  ');
空格空格hello

(7.7)rtrim(string A):Returns the string resulting from trimming spaces from the end(right hand side) of A e.g. rtrim(‘ foobar ‘) results in ‘ foobar’;  去掉字符串后面出现的空格; 返回值类型:string;

select ltrim('   hello  x'); --> hello  x
select rtrim('   hello  ');
空格空格hello

(7.8)parse_url(string urlString, string partToExtract [, string keyToExtract]):
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. 
EG: parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) ;
Returns ‘facebook.com’.
Also a value of a particular key in QUERY can be extracted by providing the key as the third argument.
EG: parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) ; returns ‘v1’.
返回从URL中抽取指定部分的内容,参数url是URL字符串,而参数partToExtract是要抽取的部分,这个参数包含(HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO,例如:parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') ='facebook.com',如果参数partToExtract值为QUERY则必须指定第三个参数key;  
如:parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') =‘v1’;

返回值类型:string;

SELECT parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1');
v1

(7.9)printf(String format, Obj… args):Returns the input formatted according do printf-style format strings (as of Hive 0.9.0); 按照printf风格格式输出字符串; 返回值类型:string;

SELECT printf("Hello World %d %s", 100, "days")FROM src LIMIT 1;
"Hello World 100 days"

(8.0)regexp_extract(string subject, string pattern, int index):Returns the string extracted using the pattern. e.g. regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’ Note that some care is necessary in using predefined character classes: using ‘\s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. The ‘index’ parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the ‘index’ or Java regex group() method. 抽取字符串subject中符合正则表达式pattern的第index个部分的子字符串,注意些预定义字符的使用,如第二个参数如果使用'\s'将被匹配到s,'\\s'才是匹配空格;
参数解释:
其中:
str是被解析的字符串
regexp 是正则表达式
idx是返回结果 取表达式的哪一部分  默认值为1。
0表示把整个正则表达式对应的结果全部返回
1表示返回正则表达式中第一个() 对应的结果 以此类推 
注意点:
要注意的是idx的数字不能大于表达式中()的个数。
否则报错; ps:返回值类型:string;

如:
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',0) from default.dual;
得到的结果为:
x=18abc
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',1) from default.dual;
得到的结果为:
18
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',2) from default.dual;
得到的结果为:
abc
我们当前的语句只有2个()表达式 所以当idx>=3的时候 就会报错
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '2': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)  on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3

(8.1)regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT, e.g. regexp_replace(“foobar”, “oo|ar”, “”) returns ‘fb.’ Note that some care is necessary in using predefined character classes: using ‘\s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. 按照Java正则表达式PATTERN将字符串INTIAL_STRING中符合条件的部分成REPLACEMENT所指定的字符串,如里REPLACEMENT这空的话,抽符合正则的部分将被去掉  如:regexp_replace("foobar", "oo|ar", "") = 'fb.' 注意些预定义字符的使用,如第二个参数如果使用'\s'将被匹配到s,'\\s'才是匹配空格; ps:返回值类型:string;

select regexp_replace("foobar", "oo|ar", "");
fb

(8.2)repeat(string str, int n): Repeat str n times; 返回值类型:string;

select repeat('hello',3);
hellohellohello

(8.3)replace(string A, string OLD, string NEW): Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). 
Example: select replace(“ababab”, “abab”, “Z”); returns “Zab”.
返回值类型:string;

select replace('Ethan-Avner','Avner','Weng');
Ethan-Weng

(8.4)reverse(string A): Returns the reversed string; 反转字符串; 返回值类型:string;

select reverse('abcd');
dcba

(8.5)sentences(string str, string lang, string locale):Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional arguments. 
e.g. sentences(‘Hello there! How are you?’) returns ( (“Hello”, “there”), (“How”, “are”, “you”) )
字符串str将被转换成单词数组,如:sentences('Hello there! How are you?') =( ("Hello", "there"), ("How", "are", "you") )
返回值类型:array<array>;

select sentences('Hello there! How are you?');
[["Hello","there"],["How","are","you"]]

(8.6)soundex(string A): Returns soundex code of the string (as of Hive 1.2.0). 
For example, soundex(‘Miller’) results in M460. 将普通字符串转换成soundex字符串;
返回值类型:string;

select soundex('Miller');
M460

(8.7)space(int n): Return a string of n spaces; 返回n个空格的字符串;  返回值类型:string;

select space(2); --> ' '(2个空格);

(8.8)split(string str, string pat): Split str around pat (pat is a regular expression); 
按照正则表达式pat来分割字符串str,并将分割后的数组字符串的形式返回;
返回值类型:array;

SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;
["one", "two", "three"]

(8.9)str_to_map(text[, delimiter1, delimiter2]): Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2. 将字符串str按照指定分隔符转换成Map,第一个参数是需要转换字符串,第二个参数是键值对之间的分隔符,默认为逗号;第三个参数是键值之间的分隔符,默认为"=";
返回值类型:map<string,string> ;

select str_to_map('a=1,b=2,c=3,d=4',',','=');
{"a":"1","b":"2","c":"3","d":"4"}
select str_to_map('a1b2c3d4',',','=');
{"a1b2c3d4":null}

str_to_map(text[, delimiter1, delimiter2]):
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and '=' for delimiter2.

使用两个分隔符将文本拆分为键值对。 Delimiter1将文本分成K-V对,Delimiter2分割每个K-V对。
对于delimiter1默认分隔符是',',对于delimiter2默认分隔符是'='。

案例1:hive> select str_to_map('aaa:11&bbb:22', '&', ':')
-- {"bbb":"22","aaa":"11"}

案例2: hive> select str_to_map('aaa:11&bbb:22', '&', ':')['aaa']
-- 11

(9.0)substr(string|binary A, int start) substring(string|binary A, int start):
Returns the substring or slice of the byte array of A starting from start position till the end of string A e.g. substr(‘foobar’, 4) results in ‘bar’; 对于字符串A,从start位置开始截取字符串并返回;
返回值类型: string;

select substr('abcd',2);
bcd

(9.1)substr(string|binary A, int start, int len) substring(string|binary A, int start, int len):
Returns the substring or slice of the byte array of A starting from start position with length len. 
e.g. substr(‘foobar’, 4, 1) results in ‘b’;
对于二进制/字符串A,从start位置开始截取长度为length的字符串并返回;
返回值类型: string;

select substr('abcd',1,3);
abc

(9.2)translate(string|char|varchar input, string|char|varchar from, string|char|varchar to):
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well (available as of Hive 0.10.0; char/varchar support added as of Hive 0.14.0.); 将input出现在from中的字符串替换成to中的字符串 如:translate("MOBIN","BIN","M")="MOM";
返回值类型: string;

select translate('abcdef', 'ada', '192') ;
1bc9ef

(6)Collection Functions 集合函数;

(6.1)size(Map):Returns the number of elements in the map type; 求map的长度; 返回值类型: int;

select size(map('100','tom','101','mary'));
2

size(Array):Returns the number of elements in the array type; 求数组的长度; 返回值类型: int;

select size(array('100','101','102','103'));
4

(6.2)map_keys(Map): Returns an unordered array containing the keys of the input map;
返回map中的所有key;  返回值类型: array;

select map_keys(map('100','tom','101','mary'));
["100","101"]

map_values(Map): Returns an unordered array containing the values of the input map;
返回map中的所有value; 返回值类型: array;

select map_values(map('100','tom','101','mary'));
["tom","mary"]

(6.3)array_contains(Array, value): Returns TRUE if the array contains value;如该数组Array<T>包含value返回true,否则返回false; 返回值类型: boolean;

select array_contains(array(1,2,3),2);
true;
select array_contains(array('100','101','102','103'),'101');
true

(6.4)sort_array(Array): Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0); 按自然顺序对数组进行排序并返回;
返回值类型: array;

select sort_array(array('100','104','102','103'));
["100","102","103","104"]

(7)Built-in Aggregate Functions (UDAF);

(7.1)collect_set(col):Returns a set of objects with duplicate elements eliminated;
     collect_list(col):Returns a list of objects with duplicates. (As of Hive 0.13.0.)
-- 参见头部;
(7.2)ntile(INTEGER x): Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.); --> 暂时不知道怎么用;
返回值类型:integer;

(7.3)histogram_numeric(col, b):Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights; 返回值类型: array<struct {‘x’,’y’}>;

SELECT histogram_numeric(val, 3) FROM src;
[{"x":100,"y":14.0},{"x":200,"y":22.0},{"x":290.5,"y":11.0}]

(8)Built-in Table-Generating Functions (UDTF) : 表生成函数;

Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows. 
Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

(8.1)inline(ARRAY<STRUCT[,STRUCT]>): Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.);  
将结构体数组提取出来并插入到表中; 
返回值类型: inline(ARRAY<STRUCT[,STRUCT]>);

select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
A	10	2015-01-01
B	20	2016-02-02
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-02-02
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
A	10	2015-01-01
B	20	2016-02-02
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-02-02

(8.2)explode(ARRAY<T> a): explode() takes in an array as an input and outputs the elements of the array as separate rows. UDTF’s can be used in the SELECT expression list and as a part of LATERAL VIEW. 对于a中的每个元素,将生成一行且包含该元素;返回值类型: explode(ARRAY<T> a);参见本文头部;
explode(ARRAY):Returns one row for each element from the array..;每行对应数组中的一个元素;
explode(MAP): Returns one row for each key-value pair from the input map with two columns in each row: one for the key and another for the value. (As of Hive 0.8.0.).
每行对应每个map键-值,其中一个字段是map的键,另一个字段是map的值;

(8.3)posexplode(ARRAY<T> a):Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array. 与explode类似,不同的是还返回各元素在数组中的位置;

select posexplode(array('A','B','C'));
0	A
1	B
2	C
select posexplode(array('A','B','C')) as (pos,val);
pos val
0	A
1	B
2	C
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
0	A
1	B
2	C
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;
pos val
0	A
1	B
2	C

(8.4)stack(int r,T1 V1,...,Tn/r Vn): Breaks up n values V1,...,Vn into r rows. Each row will have n/r columns. r must be constant. 把N列转换成R行,每行有N/R个字段,其中R必须是个常数;

select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
A	10	2015-01-01
B	20	2016-01-01
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (col0,col1,col2);
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-01-01
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf;
A	10	2015-01-01
B	20	2016-01-01
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as col0,col1,col2;
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-01-01

(8.5)json_tuple(string jsonStr,string k1,...,string kn): Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call. 从一个JSON字符串中获取多个键并作为一个元组返回,与get_json_object不同的是此函数能一次获取多个键值;

json_tuple() UDTF于Hive 0.7版本引入,它输入一个names(keys)的set集合和一个JSON字符串,返回一个使用函数的元组。从一个JSON字符串中获取一个以上元素时,这个方法比GET_JSON_OBJECT更有效。在任何一个JSON字符串被解析多次的情况下,查询时只解析一次会更有效率,这就是JSON_TRUPLE的目的。JSON_TUPLE作为一个UDTF,你需要使用LATERAL VIEW语法。

select a.timestamp, 
       get_json_object(a.appevents, '$.eventid'), 
       get_json_object(a.appenvets, '$.eventname') 
from log a;
可以转换为:
select a.timestamp, 
       b.*
from log a 
lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

(8.6)parse_url_tuple(string urlStr,string p1,...,string pn):Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>.

parse_url_tuple() UDTF与parse_url()类似,但是可以抽取指定URL的多个部分,返回一个元组。将key添加在QUERY关键字与:后面,例如:parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2')返回一个具有v1和v2的元组。这个方法比多次调用parse_url() 更有效率。所有的输入和输出类型均为string。

SELECT b.*
FROM src 
LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;

(9)Functions for Text Analytics

(9.1)context_ngrams(array<array>, array, int K, int pf):Returns the top-k contextual N-grams from a set of tokenized sentences, given a string of “context”. See StatisticsAndDataMining
(https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining) for more information. 
context_ngram()允许你预算指定上下文(数组)来去查找子序列从给定一个字符串上下文中,返回出现频率在Top-K内,符合指定pattern的词汇
返回值类型:array<struct<string,double>>;

SELECT context_ngrams(sentences(lower(tweet)), 2, 100 , 1000) FROM twitter;

(9.2)ngrams(array<array<string>>, int N, int K, int pf): Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF. See StatisticsAndDataMining 
(https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining) for more information. 
for more information. 
从给定一个字符串上下文中,返回出现次数TOP K的的子序列,n表示子序列的长度,
返回值类型:array<struct<string,double>>;

SELECT ngrams(sentences(lower(tweet)), 2, 100, 1000) FROM twitter;

 

 

 

 

 

 

转载于:https://my.oschina.net/u/3204727/blog/1571101

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hive是一个开源的数据仓库工具,它基于Hadoop构建,具有SQL的能力。Hive提供了许多内置的函数,用于处理和转换数据。以下是Hive函数的一些常用分类和相关函数的中文文档: 1. 字符串函数: - CONCAT_WS:串联多个字符串,并用指定的分隔符进行分隔。 - LENGTH:返回字符串的长度。 - SUBSTRING:返回指定位置的字符串子串。 - TRIM:去除字符串两侧的空格。 - LOWER/UPPER:将字符串转换为小写/大写。 2. 数值函数: - ROUND/FLOOR/CEIL:对数值进行四舍五入、向下取整、向上取整。 - ABS:返回数值的绝对值。 - MAX/MIN:返回一组数值的最大值和最小值。 - RAND:返回一个0到1之间的随机数。 3. 时间函数: - FROM_UNIXTIME:将Unix时间戳转换为指定格式的日期字符串。 - DATE_FORMAT:将日期字符串按照指定格式进行格式化。 - CURRENT_TIMESTAMP:返回当前的系统时间戳。 4. 条件函数: - CASE WHEN THEN ELSE END:实现条件判断和分支处理。 - COALESCE:返回第一个非NULL表达式的值。 - IF:根据条件判断返回不同的值。 5. 集合函数: - COUNT:统计表中行的数量。 - SUM/AVG:计算数值列的和/平均值。 - MAX/MIN:计算数值列的最大值/最小值。 这些只是Hive函数中的一小部分,涵盖了字符串处理、数值计算、日期时间处理、条件判断以及集合统计等方面的功能。详细的Hive函数大全中文文档可以在Hive官方文档或相关技术网站上找到。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值