Hive的函数

隐世窥红塵

于 2022-10-20 19:47:50 发布

阅读量758

点赞数

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/weixin_54064771/article/details/127432392

版权

1.内置函数

类型转换。

cast(expr as <type>)

例如：cast(“1” as bigint)

select cast(money as bigint);

切割。

split(string str, string pat)

select split('nihao|hello|nice','\\|')

正则表达式截取字符串。

regexp_extract(string subject, string pattern, int index)

select regexp_extract('hello<B>nice</B>haha','<B>(.*)</B>',1);

select regexp_extract(字段名,正则表达式,索引)

将字符串前后出现的空格去掉。

左右去掉

select trim(‘            hello         ’);

左去掉

select lrtim(‘           hello         ’);

右去掉

select rrtim(‘           hello         ’);

将字符串前后出现的空格去掉。

trim(string A)

求指定列的聚合函数。

sum(col)

avg(col)

min(col)

max(col)

select subject,sum(score)

from table_name

group by subject

select中的字段，必须要在group by后面出现出行，或者用到聚合函数中。

拼接字符串。

concat(string A, string B...)

字符串的截取。

select substr('abcde',3,2)

炸裂函数。

select explode(split("nice|good|well","\\|"));

nice|good|well

Nice

Good

Well



case when

打分、评级的时候。

数据准备。

tom,95

hua,90

hong,100

lele,85

kaka,70

kebi,60

ming,55

kang,78

lolo,93

create table student_level(name string,score int)

row format delimited fields terminated by ",";

load data local inpath '/opt/testdata/hive/student_score.txt' into table student_level;

select * from student_level;

select name,score,

case when score >= 90 then 'very good'

when score >= 80 and score <90 then 'double good'

when score >= 70 and score <80 then 'good'

when score >= 60 and score <70 then 'go on'

else 'zhencai'

end level

from student_level;

炸裂函数实际应用

tom a,b,c

jim b,c,d

tony a,c,d

create table test1(name string,subject string)

row format delimited

fields terminated by " ";

load data local inpath '/opt/testData/hive/student_info.txt' into table test1;

select name,sub

from test1

LATERAL VIEW explode(split(subject,','))temp as sub;

| tom   | a    |

| tom   | b    |

| tom   | c    |

| jim    | b    |

| jim    | c    |

| jim    | d    |

| tony   | a    |

| tony   | c    |

| tony   | d    |

日期处理函数

date_format函数

hive (gmall)> select date_format('2020-03-05','yyyy-MM');

2.处理json数据

现有json数据。

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

大数据的数据来源：爬虫采集（python+scrapy）、前后端的埋点数据、业务数据。

创建表，并load数据。

create table json(data string);

load data local inpath '/opt/testData/hive/json.txt' into table json;

查询json数据。

select get_json_object(data,'$.movie') as movie,

get_json_object(data,'$.rate') as rate

from json；

3.窗口函数

准备数据。

cookie1,2015-04-10,1
cookie1,2015-04-13,3
cookie1,2015-04-11,5
cookie1,2015-04-12,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie1,2015-04-14,4
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6

创建表。

create table cookie1(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';

加载数据。

load data local inpath "/usr/datadir/cookie1.txt" into table cookie1;

sum(pv) over()

代码:

select cookieid,createtime,pv, 
   sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, 
   sum(pv) over (partition by cookieid order by createtime) as pv2, 
   sum(pv) over (partition by cookieid) as pv3, 
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, 
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 2 following) as pv5, 
   sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 
from cookie1;

3.1窗口分片

数据准备。

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

创建表

create table cookie2(cookieid string, createtime string, pv int)
row format delimited
fields terminated by ',';

加载数据。

load data local inpath 'opt/testdata/hive/cookie2.txt' into table cookie2;

查看数据

以下不支持rows between

ntile(n) over()

按顺序将组内的数据分为几片，一般用来求前几分之几的数据。

ntile(2) over (partition by cookieid order by createtime) as rn1

ntile(3) over (partition by cookieid order by createtime) as rn2,

如果不加分区，会将所有数据分成多片。

ntile(4) over (order by createtime) as rn3

代码:

select cookieid,createtime,pv,
  ntile(2) over (partition by cookieid order by createtime) as rn1,
  ntile(3) over (partition by cookieid order by createtime) as rn2,
  ntile(4) over (order by createtime) as rn3
from cookie1 
order by cookieid,createtime;

结果:

比如，统计一个cookie，pv数最多的前1/3的天。

create table cookie_temp
as
select cookieid,createtime,pv,
  ntile(3) over (partition by cookieid order by pv desc) as rn 
from cookie2;

结果:

我们取rn=1的就是pv最多的前三分之一。

3.2窗口排序

row_number() over(),rank() over(),dense_rank() over()

分组排序，并记录名次，一般用来取前n名

代码:

select cookieid,createtime,pv,
  rank() over (partition by cookieid order by pv desc) as rn1,
  dense_rank() over (partition by cookieid order by pv desc) as rn2,
  row_number() over (partition by cookieid order by pv desc) as rn3
from cookie2 
where cookieid='cookie1';

结果：

3.3上下移动

数据准备。

cookie1,2015-04-10 10:00:02,url2
cookie1,2015-04-10 10:00:00,url1
cookie1,2015-04-10 10:03:04,url3
cookie1,2015-04-10 10:50:05,url6
cookie1,2015-04-10 11:00:00,url7
cookie1,2015-04-10 10:10:00,url4
cookie1,2015-04-10 10:50:01,url5
cookie2,2015-04-10 10:00:02,url22
cookie2,2015-04-10 10:00:00,url11
cookie2,2015-04-10 10:03:04,url33
cookie2,2015-04-10 10:50:05,url66
cookie2,2015-04-10 11:00:00,url77
cookie2,2015-04-10 10:10:00,url44
cookie2,2015-04-10 10:50:01,url55

创建表。

create table cookie3(cookieid string, createtime string, url string)
row format delimited fields terminated by ',';

加载数据。

load data local inpath '/opt/testData/hive/cookie3.txt' into table cookie3;

查看数据。

LAG(col,n,DEFAULT)

用于将当前列往上移n行

第一个参数为列名。

第二个参数为往上第n行（可选，默认为1）。

第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）。

代码：

select cookieid,createtime,url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as last_1_time, 
  LAG(createtime,2) over (partition by cookieid order by createtime) as last_2_time 
from cookie3;

结果：

LEAD(col,n,DEFAULT)

与上面的相似，用于将当前列往下移n行。

代码：

select cookieid,createtime,url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  LEAD(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as next_1_time,
  LEAD(createtime,2) over (partition by cookieid order by createtime) as next_2_time 
from cookie3;

结果：

3.4首尾值

FIRST_VALUE(url) over ()

分组排序后截至到当前行的第一个值。

FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1

LAST_VALUE(url) over ()

分组排序后截至到当前行的最后一个值。

FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last2

代码：

select cookieid,createtime,url,
row_number() over (partition by cookieid order by createtime) as rn,
FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1,
LAST_VALUE(url) over (partition by cookieid order by createtime desc) as last2
from cookie3;

结果：

4.自定义函数

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数。

UDF（user-defined function）作用于单个数据行，产生一个数据行作为输出。

UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。类似于max、min。

UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行。类似于explode。

自定义UDF

1.创建Maven项目，并导入依赖（eclipse中也可将hive-exec的jar包复制进来然后build path）。

      <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.3.3</version>
            <exclusions>
                <exclusion>
                    <groupId>jdk.tools</groupId>
                    <artifactId>jdk.tools</artifactId>
                </exclusion>
            </exclusions> 
       </dependency>

2.自定义一个java类继承UDF，重载 evaluate 方法。

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLower extends UDF {
    public String evaluate(String field) {
        String result = field.toLowerCase();
        return result;
    }

打成jar包上传到服务器。

添加jar包到hive中。

add JAR /opt/testData/hive/udf.jar;

在HQL种使用。

永久生效。

如果需要经常使用该自定义函数，可以考虑创建永久函数：

拷贝jar包到hive的lib目录下。

创建永久关联函数。

create function tolowercase as 'cn.jixiang.udf.ToLower';

删除函数。

删除临时函数

drop temporary function tolowercase;

删除永久函数

drop function tolowercase;

5.Hive的Shell操作

hive -e 从命令行执行指定的HQL

例：hive -e "select * from student"

hive -f 执行 HQL 脚本

例：echo "select * from student" > hive.sql

    hive -f hive.sql

隐世窥红塵

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hive的函数

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数。UDF（user-defined function）作用于单个数据行，产生一个数据行作为输出。UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。类似于max、min。UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行。类似于explode。
复制链接

扫一扫