hive正则表达式regexp_extract的第三个参数index

最新推荐文章于 2024-08-08 21:45:08 发布

老胡当道卧

最新推荐文章于 2024-08-08 21:45:08 发布

阅读量7.2k

点赞数 1

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/sinat_27339001/article/details/78594476

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

表格的原始数据如下：

a
152天内有67天无通话记录
71天内有58天无通话记录
154天内有8天无通话记录
178天内有76天无通话记录
NULL
159天内有69天无通话记录
手机关机时间从未超过1天
171天内有63天无通话记录
163天内有90天无通话记录
160天内有35天无通话记录

现在想计算其中无通话记录的天数的占比，又不想在代码中使用中文（避免编码问题），该怎么做呢？

当然是正则表达式了！

hive的正则表达式函数是regexp_extract，具有三个参数：
regexp_extract(string subject, string pattern, int index)
前两个都好理解，第三个是什么意思呢？
不如试试看：

select
    case
        when regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',3)!=''
            then regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',3)/regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',1)
        when a is null then -9999990
        else 0 end as rate,
    regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',0),
    regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',1),
    regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',2),
    regexp_extract(a,'([0-9]+)([^0-9]+)([0-9]+)',3),
    a

from table

rate _c0 _c1 _c2 _c3 a
0.440789474 152天内有67 152 天内有 67 152天内有67天无通话记录
0.816901408 71天内有58 71 天内有 58 71天内有58天无通话记录
0.051948052 154天内有8 154 天内有 8 154天内有8天无通话记录
0.426966292 178天内有76 178 天内有 76 178天内有76天无通话记录
-9999990 NULL NULL NULL NULL NULL
0.433962264 159天内有69 159 天内有 69 159天内有69天无通话记录
0 手机关机时间从未超过1天
0.368421053 171天内有63 171 天内有 63 171天内有63天无通话记录
0.552147239 163天内有90 163 天内有 90 163天内有90天无通话记录
0.21875 160天内有35 160 天内有 35 160天内有35天无通话记录

rate	_c0	_c1	_c2	_c3	a
0.440789474	152天内有67	152	天内有	67	152天内有67天无通话记录
0.816901408	71天内有58	71	天内有	58	71天内有58天无通话记录
0.051948052	154天内有8	154	天内有	8	154天内有8天无通话记录
0.426966292	178天内有76	178	天内有	76	178天内有76天无通话记录
-9999990	NULL	NULL	NULL	NULL	NULL
0.433962264	159天内有69	159	天内有	69	159天内有69天无通话记录
0					手机关机时间从未超过1天
0.368421053	171天内有63	171	天内有	63	171天内有63天无通话记录
0.552147239	163天内有90	163	天内有	90	163天内有90天无通话记录
0.21875	160天内有35	160	天内有	35	160天内有35天无通话记录