Hive基本内置函数，自定义函数以及性能优化

最新推荐文章于 2023-03-11 14:36:45 发布

May--J--Oldhu

最新推荐文章于 2023-03-11 14:36:45 发布

阅读量1.3k

点赞数

分类专栏： Hive 文章标签： hadoop hive 大数据

本文链接：https://blog.csdn.net/May_J_Oldhu/article/details/108759556

版权

本文介绍了Hive的基本内置函数，包括字符、类型转换、日期、集合、条件和聚合函数。此外，详细阐述了自定义函数的创建步骤，并探讨了Hive事务的概念、ACID特性及其局限。还提到了Hive的PLSQL支持和性能调优，如使用EXPLAIN和ANALYZE工具，以及优化设计、Job优化策略和查询优化技巧，如启用Map端Join、防止数据倾斜和使用CBO等。

摘要由CSDN通过智能技术生成

一.内置函数

1.字符函数

返回值	函数	描述
string	concat(string\|binary A, string\|binary B…)	对二进制字节码或字符串按次序进行拼接
int	instr(string str, string substr)	查找字符串str中子字符串substr出现的位置
int	length(string A)	返回字符串的长度
int	locate(string substr, string str[, int pos])	查找字符串str中的pos位置后字符串substr第一次出现的位置
string	lower(string A) /upper(string A)	将字符串A的所有字母转换成小写/大写字母
string	regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)	按正则表达式PATTERN将字符串中符合条件的部分替换成REPLACEMENT所指定的字符串
array	split(string str, string pat)	按照正则表达式pat来分割字符串str
string	substr(string\|binary A, int start, int len)substring(string\|binary A, int start, int len)	对字符串A,从start位置开始截取长度为len的字符串并返回
string	trim(string A)	将字符串A前后出现的空格去掉
map	str_to_map(text[, delimiter1, delimiter2])	将字符串str按照指定分隔符转换成Map
binary	encode(string src, string charset)	用指定字符集charset将字符串编码成二进制值

示例：

--concat(),连接两个字符串
0: jdbc:hive2://localhost:10000> select concat("hello ","world");
+--------------+--+
|     _c0      |
+--------------+--+
| hello world  |
+--------------+--+
--instr(),查找位置
0: jdbc:hive2://localhost:10000>select instr('hello hello hello world','llo');
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
--locate(),从第四个开始找，查找位置
0: jdbc:hive2://localhost:10000>select locate('llo','hello hello hello world',4)
+------+--+
| _c0  |
+------+--+
| 9    |
+------+--+
--lower(),upper()
0: jdbc:hive2://localhost:10000> select lower('hello WORLD'),UPPER('hello WORLD');
+--------------+--------------+--+
|     _c0      |     _c1      |
+--------------+--------------+--+
| hello world  | HELLO WORLD  |
+--------------+--------------+--+
--regexp_replace(),将123替换成world
0: jdbc:hive2://localhost:10000> select regexp_replace('hello 123 world','[1-9]{3}','world');
+--------------------+--+
|        _c0         |
+--------------------+--+
| hello world world  |
+--------------------+--+
--split()
0: jdbc:hive2://localhost:10000> select split('hello world',' ');
+--------------------+--+
|        _c0         |
+--------------------+--+
| ["hello","world"]  |
+--------------------+--+
--sundtring()，截取字符段
0: jdbc:hive2://localhost:10000> select substring('hello world',7,5);
+--------+--+
|  _c0   |
+--------+--+
| world  |
+--------+--+
1 row selected (0.036 seconds)
--trim(),去掉空格
0: jdbc:hive2://localhost:10000> select trim('         hello world     ');
+--------------+--+
|     _c0      |
+--------------+--+
| hello world  |
+--------------+--+
1 row selected (0.021 seconds)
--通过正则去掉空格
0: jdbc:hive2://localhost:10000> select regexp_replace('hello world','\\s+','');
+-------------+--+
|     _c0     |
+-------------+--+
| helloworld  |
+-------------+--+

2.类型转换函数和数学函数

类型转换函数

返回值	类型转换函数	描述
“type”	cast(expr as )	将expr转换成type类型如：cast(“1” as BIGINT) 将字符串1转换成了BIGINT类型
binary	binary(string\|binary)	将输入的值转换成二进制

--cast()
0: jdbc:hive2://localhost:10000> select cast(1 as double);
+------+--+
| _c0  |
+------+--+
| 1.0  |
+------+--+
--binary()会将输入的值转成二进制，但输出的时候hive又会转成十进制输出
0: jdbc:hive2://localhost:10000> select binary('hello world');
+--------------+--+
|     _c0      |
+--------------+--+
| hello world  |
+--------------+--+

数学函数

返回值	数学函数	描述
DOUBLE	round(DOUBLE a)	返回对a四舍五入的BIGINT值
binary	round(DOUBLE a, INT d)	返回对a四舍五入并保留d位小数位的值
BIGINT	floor(DOUBLE a)	向下取整，如：6.10->6 -3.4->-4
DOUBLE	rand(INT seed)	返回一个DOUBLE型随机数，seed是随机因子
DOUBLE	power(DOUBLE a, DOUBLE p)	计算a的p次幂
DOUBLE	abs(DOUBLE a)	计算a的绝对值

--round()
0: jdbc:hive2://localhost:10000> select round(3.14159265);
+------+--+
| _c0  |
+------+--+
| 3.0  |
+------+--+
1 row selected (0.043 seconds)
--round(DOUBLE a, INT d)
0: jdbc:hive2://localhost:10000> select round(3.14159265,4);
+---------+--+
|   _c0   |
+---------+--+
| 3.1416  |
+---------+--+
1 row selected (0.044 seconds)
--floor()
0: jdbc:hive2://localhost:10000> select floor(3.8);
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
1 row selected (0.024 seconds)
--rand()
0: jdbc:hive2://localhost:10000> select rand(10);
+---------------------+--+
|         _c0         |
+---------------------+--+
| 0.7304302967434272  |
+---------------------+--+
1 row selected (0.024 seconds)
--power(),求2的3次方
0: jdbc:hive2://localhost:10000> select power(2,3);
+------+--+
| _c0  |
+------+--+
| 8.0  |
+------+--+
1 row selected (0.023 seconds)
--4开方
0: jdbc:hive2://localhost:10000> select power(4,0.5);
+------+--+
| _c0  |
+------+--+
| 2.0  |
+------+--+
--abs(),绝对值
0: jdbc:hive2://localhost:10000> select abs(-10);
+------+--+
| _c0  |
+------+--+
| 10   |
+------+--+

3.日期函数

返回值	函数	描述
string	from_unixtime(bigint unixtime[, string format])	将时间戳转换成format格式
int	unix_timestamp()	获取本地时区下的时间戳
bigint	unix_timestamp(string date)	将格式为yyyy-MM-dd HH:mm:ss的时间字符串转换成时间戳
string	to_date(string timestamp)	返回时间字符串的日期部分
int	year(string date)month/day/hour/minute/second/weekofyear	返回时间字符串的年份部分返回月/天/时/分/秒/第几周
int	datediff(string enddate, string startdate)	计算开始时间到结束时间相差的天数
string	date_add(string startdate, int days)	从开始时间startdate加上days
string	date_sub(string startdate, int days)	从开始时间startdate减去days
date	current_date	返回当前时间的日期
timestamp	current_timestamp	返回当前时间戳
string	date_format(date/timestamp/string ts, string fmt)	按指定格式返回时间date 如：date_format(“2016-06-22”,“MM-dd”)=06-22

--时间戳返回的是秒
--unix_timestamp(),时间戳转成秒数
0: jdbc:hive2://localhost:10000> select unix_timestamp('2020-8-25 12:22:12');
+-------------+--+
|     _c0     |
+-------------+--+
| 1598329332  |
+-------------+--+
1 row selected (0.024 seconds)
0: jdbc:hive2://localhost:10000> select unix_timestamp();
+-------------+--+
|     _c0     |
+-------------+--+
| 1600763677  |