Hive常用函数

最新推荐文章于 2024-09-11 22:16:31 发布

敲代码的羊

最新推荐文章于 2024-09-11 22:16:31 发布

阅读量1.1k

点赞数 3

文章标签： hive big data hadoop

本文链接：https://blog.csdn.net/iyak78/article/details/122141084

版权

本文详细介绍了Hive中的常用函数，包括字符串处理、日期日期转换、聚合函数以及开窗函数等，展示了如何使用Hive进行数据仓库构建和数据分析。此外，还探讨了Hive元数据的管理和替代工具如Presto、Impala的工作原理。

摘要由CSDN通过智能技术生成

Hive常用函数

功能

用Hive计算的话,主要是通过Hive将SQL转换为MR/Spark/Tez程序,很少用,有很多的替代品来实现(Presto、Impala、SparkSQL)

思考🤔:替代软件是怎么知道Hive中有哪些表,那些库?
Hive中要启动2个进程,metastore和hiveserver2

思考🤔:为什么要这两个进程,只启动HiveServer2可以吗?
可以,但其他软件需要通过metastore获取元数据,metastore的功能就是共享元数据,对外提供所有元数据请求

将HDFS文件映射成表(重要),构建数据仓库

场景:利用Hive构建数仓

核心: SQL+函数

sql语法 + hive中特殊的语法:sort by/ lateral view/ distribute by/ partition by/ clustered by
- 查询、分组、排序、过滤
```
# SQL的基本语法
1 select 2 from 3 where 4 group by 5 having 6 order by 7 limit
```
函数: 条件、开窗函数、特殊函数
字符串

函数	解释	语法	举例
substring	字符串截取	substring(字符串,开始位置,[长度])	select substr(‘abcde’,3) cde select substr(‘abcde’,-1) e select substr(‘abcde’,3,2) cd
split	分割字符串	split(string str, string 分隔符)	select split(‘abtcdtef’,‘t’) [“ab”,“cd”,“ef”]
length	长度	length(string A)	select length(‘abcedfg’) 7
regex_replace	正则替换	regex_replace(将strA中,符合strB的,替换成strC)	select regexp_replace(‘foobar’, ‘oo\|ar’, ‘’) fb
regex_extract	正则解析	regexp_extract(将string 字符串, s按string 正则拆分, 返回int 指定下标字符0开始)	select regexp_extract(‘foothebar’, ‘foo(.?)(bar)’, 1) the select regexp_extract(‘foothebar’, 'foo(.?)(bar)’, 2) bar select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 0) foo
Trim	去空格	trim(string A)	select trim(’ abc ') abc
Concat	连接	concat(拼接string A, 和string B…)	select concat(‘abc’,'def’,'gh’) abcdefgh
Concat_ws	带分隔符连接	concat_ws(string 分隔符, string A, string B…)	select concat_ws(’,!,‘abc’,‘def’,‘gh’) abc!def!gh
Instr	查找	instr(在string str中, string substr首次出现的位置)	select instr(‘abcdf’,‘df’) 4

日期:

函数	解释	语法	举例
date_add	日期增加	date_add(string startdate, int days) 返回开始日期startdate增加days天后的日期	select date_add(‘2012-12-08’,10) 2012-12-18 select date_add(‘2012-12-08’,-5) 2012-12-03
date_sub	日期减少	date_sub (string startdate, int days) 返回开始日期startdate减少days天后的日期	select date_sub(‘2012-12-08’,5) 2012-12-03
datediff	日期比较	datediff(string enddate, string startdate) 前面结束日期减去后面的开始日期的天数	select datediff(‘2012-12-08’,‘2012-05-09’) 213
unix_timestmap	指定格式日期转UNIX时间戳	unix_timestamp(string date, [string pattern]) 返回值: bigint 转换pattern格式的日期到UNIX时间戳,默认为"yyyy-MM-dd HH:mm:ss",如果转化失败，则返回0	select unix_timestamp(‘20111207 13:01:03’,‘yyyyMMdd HH:mm:ss’) 1323234063
from_unixtime	UNIX时间戳转日期	from_unixtime(bigint unixtime[, string format]) 返回值: string	select from_unixtime(1323308943,‘yyyyMMdd’) 20111208
year	日期转年	year(string date)	select year(‘2011-12-08 10:03:01’) 2011
month	日期转月	month (string date)	select month(‘2011-12-08 10:03:01’) 12
weekofyear	日期转周	weekofyear (string date)	select weekofyear(‘2011-12-08 10:03:01’) 49
day	日期转天	day (string date)	select day(‘2011-12-08 10:03:01’) 8
hour	日期转小时	hour (string date)	select hour(‘2011-12-08 10:03:01’) 10
Minute	日期转分钟	minute (string date)	select minute(‘2011-12-08 10:03:01’) 3

聚合:

函数	解释	语法	举例
sum	总和统计	sum(col), sum(DISTINCT col)	select sum(t) from lxw_dual 100 select sum(distinct t) from lxw_dual 70
max	最大值统计	max(col) 返回最大值	select max(t) from lxw_dual 120
min	最小值统计	min(col) 返回最小值	select min(t) from lxw_dual 20
count	数量统计	count(*) 统计检索出的行的个数，包括NULL值的行 count(expr) 返回指定字段的非空值的个数； count(DISTINCT expr[, expr.]) 返回指定字段的不同的非空值的个数	select count(*) 20 select count(distinct t) 10
avg	平均值统计	avg(col), avg(DISTINCT col)	select avg(t) from lxw_dual 50 select avg (distinct t) from lxw_dual 30
ceil/ceiling	向上取整	ceil(double a) 返回值:bigint	select ceiling(3.1415926) 4 select ceiling(46) 46
floor	向下取整	floor(double a) 返回值:bigint	select floor(3.1415926) 3 select floor(25) 25
round	取整	round(double a,[int]) 返回:double 类型的整数值部分,加上int可保留小数	select round(3.1415926) 3 select round(3.1415926,4) 3.1415

条件:

函数	解释	语法	举例
if	if(如果)	if(A,B,C) 如果A为true,返回B,否则返回C	select if(1=2,100,200) 200 select if(1<2,100,200) 100
coalesce	非空查找	COALESCE(T v1, T v2, …) 返回参数中第一个非空值	select COALESCE(null,‘100’,'50′) 100
case	条件判断	CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 如果a等于b，那么返回c；如果a等于d，那么返回e；否则返回f	Select case 100 when 50 then ‘tom’ when 100 then ‘mary’ else ‘tim’ end mary
case	条件判断	CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 如果a为TRUE,则返回b；如果c为TRUE，则返回d；否则返回e	select case when 1=1 then ‘tom’ when 2=2 then ‘mary’ else ‘tim’ end tom

开窗函数语法: 函数() OVER(PARTITION BY xxx ORDER BY xxx)

聚合开窗:

函数	解释	语法	举例
sum	总和统计	sum(col), sum(DISTINCT col)	select sum(t) from lxw_dual 100 select sum(distinct t) from lxw_dual 70
max	最大值统计	max(col) 返回最大值	select max(t) from lxw_dual 120
min	最小值统计	min(col) 返回最小值	select min(t) from lxw_dual 20
count	数量统计	count(*) 统计检索出的行的个数，包括NULL值的行 count(expr) 返回指定字段的非空值的个数； count(DISTINCT expr[, expr.]) 返回指定字段的不同的非空值的个数	select count(*) 20 select count(distinct t) 10
avg	平均值统计	avg(col), avg(DISTINCT col)	select avg(t) from lxw_dual 50 select avg (distinct t) from lxw_dual 30

分析开窗:

函数	解释	语法	举例
first_value	取分组内排序后，截止到当前行，第一个值		FIRST_VALUE(url) over (partition by cookieid order by createtime desc)
last_value	取分组内排序后，截止到当前行，最后一个值		LAST_VALUE(url) over (partition by cookieid order by createtime) as last1
lag	统计窗口内往上第n行值	LAG(col,n,[x]) col向上取第n行,如果为null返回x,不指定则返回null
lead	统计窗口内往下第n行值	LEAD(col,n,[x]) col向下取第n行,如果为null返回x,不指定则返回null

排序开窗:

函数	解释	语法
rank	排名序号可重复但不连续	rank(col) over(partition by)
dense_rank	排名序号可重复且连续	dense_rank(col) over(partition by)
row_number	排名序号不重复且连续，直接返回行号	row_number(col) over(partition by)
ntile	将排序后的分区内数据分组，产生组序号	ntile(x) over(partition by)

特殊函数:

函数	解释	语法	举例
explode	行拆列	explode(可迭代数据)	select explode(array(1,2,3)) 1 2 3
collect_list	列转行(有序,不去重)	collect_list(col)[int下标]	selectusername, collect_list(video_name) from t_visit_video group by username
collect_set	列转行(无序,去重)	collect_set(col)[int下标]
json_tuple	json解析	json_tuple(jsonStr, k1, k2, …) 参数为一组键k1，k2，。。。。。和json字符串，返回值的元组
get_json_object	json解析	get_json_object(string json_string, string path) 解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效，那么返回NULL	select get_json_object(’{“store”:{“fruit”:[{“weight”:8,“type”:“apple”},{“weight”:9,“type”:“pear”}],“bicycle”:{“price”:19.95,“color”:“red”}},“email”:“amy@only_for_json_udf_test.net”,“owner”:“amy”}’,’$.owner’) Amy
parse_url	URL解析	parse_url(string ‘URL’, string partToExtract [, string keyToExtract]) 返回URL中指定的部分。partToExtract的有效值为：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO	select parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) facebook.com select parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) v1
cast	类型转换	cast(A as ) 将A转换成指定type类型	select cast(1 as bigint) 1