SQL之一天一个小技巧：如何使用HQL从不固定位置提取字符串元素【详解Hive字符串位置查找函数】

莫叫石榴姐

已于 2022-04-19 10:45:50 修改

阅读量3.2k

点赞数 1

分类专栏： SQLBOY1000题 sql HiveSql面试题文章标签： sql hive 大数据

于 2021-08-15 19:24:20 首次发布

本文链接：https://blog.csdn.net/godlovedaniel/article/details/119714663

版权

sql 同时被 3 个专栏收录

111 篇文章 43 订阅

订阅专栏

SQLBOY1000题

103 篇文章 216 订阅

订阅专栏

HiveSql面试题

98 篇文章 114 订阅

订阅专栏

0 问题描述

1 问题解决

2 小结

0 问题描述

SQL 从不固定位置提取字符串的元素，你有一个字符串，其中包含一段连续的日志数据。你想解析该字符串，并从中提取出部分信息。不过，你需要的信息并不存在于字符串的固定位置。因此，你必须借助目标信息附近的某些字符来定位并提取所需的内容。例如，考虑下面的字符串。

xxxxxabc[867]xxx[-]xxxx[5309]xxxxx
xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx
call:[F_GET_ROWS( )]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx
film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx

你希望提取出方括号内的值，返回如下所示的结果集。

FIRST_VAL       SECOND_VAL      LAST_VAL
--------------- --------------- ---------------
867            -                5309
11271978       4                Joe
F_GET_ROWS( )  ROSEWOOD...SIR   44400002
non_marked     unit             withabanana?

1 问题解决

主要思路：分析字符串，抓主要特征。

尽管不知道我们所需要的的字符的确切位置，但我们确定它们是被包含在方括号“[]”中的，并且知道有 3 组这样的值。如果能够找出方括号的位置，并使用内置函数 SUBSTR 从字符串中提取所需要的值，那么该问题就可以得到解决，因此本问题的关键还是找出特征方括号的位置。

(1)数据准备

create table log
as
select 'xxxxxabc[867]xxx[-]xxxx[5309]xxxxx' msg
 union all
select 'xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx' msg
 union all
select 'call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx' msg
 union all
select 'film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx' msg

(2) 数据分析

Hive中获取字符串位置函数有以下几种：

instr(String str, String substr)函数

返回str中第一次出现substr的索引

instr函数返回字符串str中子字符串substr第一次出现的位置，其中第一字符的位置是1,如果 str不含substr

返回0。

eg:

0: jdbc:hive2://10.9.4.117:10000> select instr("abcde",'b');
+------+--+
| _c0  |
+------+--+
| 2    |
+------+--+

但是要注意在hive中该函数只能获取字符串首次出现的位置，也就是只能传两个参数，不能获取第二次、第三次出现的位置。在oracle数据库中该函数可以获取第二次、第三次出现的位置

集合查找函数: find_in_set

语法: find_in_set(string str, string strList)
返回值: int
说明: 返回str在strlist第一次出现的位置，strlist是用逗号分割的字符串。如果没有找该str字符，则返回0

hive> select find_in_set('de','ef,ab,de') ;
3
hive> select find_in_set('at','ef,ab,de') ;
0

同样该函数也只能获取字符串首次出现的位置，且字符串是以逗号隔开的字符串集合。

字符串查找函数:locate

语法: locate(string substr, string str, int pos)
返回值: int
说明:返回字符串 substr 在 str 中从 pos 后查找,首次出现的位置

举例如下：

hive> select locate('a','abcda',2) ;
OK
5
Time taken: 0.14 seconds, Fetched: 1 row(s)
hive> select locate('a','abcdabaaaa',2) ;
OK
5
Time taken: 0.182 seconds, Fetched: 1 row(s)
hive> select locate('a','abcdabaaaa',3) ;
OK
5
Time taken: 0.167 seconds, Fetched: 1 row(s)

通过上述Hive中字符串位置查找函数对比，本题显然locate（）函数比较合适。但需要适当转换。

具体SQL如下：

select msg
      ,locate('[',msg,1) as s_1
      ,locate(']',msg,1) as e_1
      ,locate('[',msg,locate('[',msg,1)+1) s_2
      ,locate(']',msg,locate(']',msg,1)+1) e_2
      ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3
      ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3
from log

注意边界问题：一般都是左闭右开的，所以再求第二、第三索引时候，我们在嵌套计算的时候都进行了加1操作。计算结果如下：

xxxxxabc[867]xxx[-]xxxx[5309]xxxxx	9	13	17	19	24	29
xxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxx	11	20	28	30	34	38
call:[F_GET_ROWS()]b1:[ROSEWOOD...SIR]b2:[44400002]77.90xxxxx	6	19	23	38	42	51
film:[non_marked]qq:[unit]tailpipe:[withabanana?]80sxxxxx	6	17	21	26	36	49
Time taken: 0.677 seconds, Fetched: 4 row(s)

针对上述求出的结果，我们利用substr()函数根据求得的索引位置进行截串获取对应的值。

select substr(msg,s_1+1,e_1-s_1-1) as fst_val
      ,substr(msg,s_2+1,e_2-s_2-1) as snd_val
      ,substr(msg,s_3+1,e_3-s_3-1) as thd_val
from(
    select msg
          ,locate('[',msg,1) as s_1
          ,locate(']',msg,1) as e_1
          ,locate('[',msg,locate('[',msg,1)+1) s_2
          ,locate(']',msg,locate(']',msg,1)+1) e_2
          ,locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1) s_3
          ,locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1) e_3
    from log
) t

这里需要注意提取值时，其实位置为方括号"["索引值+1从该处开始截取，截取长度为"]"位置的索引值减去"["位置处的索引值再减1.结果如下：

867	        -	        5309
11271978	4	        Joe
F_GET_ROWS()	ROSEWOOD...SIR	44400002
non_marked	unit	        withabanana?

代码优化调整如下：

select substr(msg,
        locate('[',msg,1)+1,
        locate(']',msg,1)-locate('[',msg,1)-1) as fst_val
      ,substr(msg,
        locate('[',msg,
        locate('[',msg,1)+1)+1,
        locate(']',msg,locate(']',msg,1)+1)-
        locate('[',msg,locate('[',msg,1)+1)-1) as snd_val
      ,substr(msg,
        locate('[',msg,
        locate('[',msg,locate('[',msg,1)+1)+1)+1,
        locate(']',msg,locate(']',msg,locate(']',msg,1)+1)+1)-
        locate('[',msg,locate('[',msg,locate('[',msg,1)+1)+1)-1) as thd_val
from log

2 小结

本文通过HQL语言对从不固定位置提取字符串的元素这一问题进行了分析，其解决方案主要采用了locate()函数及substr()函数进行分析，文中对Hive中获取字符串位置函数进行了详细分析。通过本文你可以收获如下：

（1）如何通过HQL来从不固定位置提取字符串的元素的方法和技巧
（2）掌握Hive中字符串位置获取函数使用方法，并在实际业务中使用

欢迎关注石榴姐公众号"我的SQL呀"，关注我不迷路

莫叫石榴姐

关注

1
点赞
踩
14

收藏

觉得还不错? 一键收藏
3
评论
SQL之一天一个小技巧：如何使用HQL从不固定位置提取字符串元素【详解Hive字符串位置查找函数】

目录0 问题描述1 问题解决2 小结0 问题描述SQL 从不固定位置提取字符串的元素，你有一个字符串，其中包含一段连续的日志数据。你想解析该字符串，并从中提取出部分信息。不过，你需要的信息并不存在于字符串的固定位置。因此，你必须借助目标信息附近的某些字符来定位并提取所需的内容。例如，考虑下面的字符串。xxxxxabc[867]xxx[-]xxxx[5309]xxxxxxxxxxtime:[11271978]favnum:[4]id:[Joe]xxxxxcall:[F_GET_R
复制链接

扫一扫