正则表达式(二):子表达式及反向引用_sublime 正则子模式反引用嵌套-CSDN博客

本文链接：https://blog.csdn.net/yeshang_lady/article/details/100576700

1.子表达式

子表达式通常作为更长的表达式的一部分，子表达式可以作为单一的实体来使用。子表达式由（）来定义。前文说过，表示重复次数的元字符只能作用于紧挨着它的前一个字符，引入子表达式之后，就可以让子表达式作为一个整体重复多次。Python和Hive中都有语法帮助我们提取出子表达式匹配的内容。

Python版本：

import re
result=re.search(r'(\d{4})-(\d{8})','0571-68819999')
print(result.groups())

实验结果：

('0571', '68819999')

Hive版本：

select '0' as num, regexp_extract('0571-68819999','^(\\d{4})-(\\d{8})$',0) as result
union
select '1' as num, regexp_extract('0571-68819999','^(\\d{4})-(\\d{8})$',1) as result
union
select '2' as num, regexp_extract('0571-68819999','^(\\d{4})-(\\d{8})$',2) as result

代码结果如下：(从实验结果可以看出，Hive中分组的索引是从1开始的，0对应的是完整匹配）

num	result
0	0571-68819999
1	0571
2	68819999

子表达式是支持嵌套的，但是嵌套之后各个表达式的索引会容易引起混乱。

Python版本：

import re
str_1='135-0577-2345'
match=re.search(r'(((\d{3})-(\d{4}))-(\d{4}))',str_1)
print(match.groups())

代码结果：

('135-0577-2345', '135-0577', '135', '0577', '2345')

从结果和其对应的索引关系可以看出，索引编号是从外到内，从左到右，深度优先。

Hive版本：

select '1' as num, regexp_extract('135-0577-2345','(((\\d{3})-(\\d{4}))-(\\d{4}))',1) as result
union 
select '2' as num, regexp_extract('135-0577-2345','(((\\d{3})-(\\d{4}))-(\\d{4}))',2) as result
union 
select '3' as num, regexp_extract('135-0577-2345','(((\\d{3})-(\\d{4}))-(\\d{4}))',3) as result
union 
select '4' as num, regexp_extract('135-0577-2345','(((\\d{3})-(\\d{4}))-(\\d{4}))',4) as result
union 
select '5' as num, regexp_extract('135-0577-2345','(((\\d{3})-(\\d{4}))-(\\d{4}))',5) as result

代码结果如下：(Hive中的索引和Python中的索引顺序一致）

num	result
1	135-0577-2345
2	135-0577
3	135
4	0577
5	2345

2. 反向引用

反向引用允许正则表达式模式引用之前匹配的结果。反向引用需要结合字表达式一起使用。这个主要用来匹配需要成对出现的模式。以一个Python程序为例(代码中的\1表示引用分组索引为1的匹配结果）：

import re
str_1='<h1>helloworld</h1>'
str_2='<h1>pythonclass</h2>'
match_1=re.search(r'<(h[0-9])>\w*?</\1>',str_1)
print(match_1[0] if match_1 else 'null')
match_2=re.search(r'<(h[0-9])>\w*?</\1>',str_2)
print(match_2[0] if match_2 else 'null')

其结果如下：

<h1>helloworld</h1>
null

Hive和MySQL中也支持反向引用。注意，我用的MySQL8.0版本，在其他书上看到MySQL5版本可能是不支持反向引用的。

select '1' as num, '<h1>hello123world</h1>' regexp '<(h[1-6])>\\w+</\\1>' as result
union
select '2' as num, '<h1>hello123world</h2>' regexp '<(h[1-6])>\\w+</\\1>' as result

Hive运行结果如下：

num	result
1	true
2	false

3. MySQL8.0中支持的正则表达式函数

Name	Description
regexp	字符串中是否有与模式匹配的子串，如果有返回1或True,没有返回0或False。
not regexp	将 regexp表达的结果置反。
regexp_like	与regexp结果相同。
rlike	与regexp结果相同。
regexp_instr	返回字符串中与模式匹配的子串的开始索引位置，若没有子串与模式匹配，则返回0。
regexp_replace	将字符串中与模式匹配的子串替换成其他字符串。
regexp_substr	返回字符串中与模式匹配的子串。

regexp_like代码(MySQL中字符串索引从1开始):

select '1' as num,regexp_instr('hello123world456','[0-9]+') AS result
union all
-- 参数7用来指定开始搜索的位置,不写默认为1
select '2' as num,regexp_instr('hello123world456','[0-9]+',7) as result
union all
-- 参数2的作用：当字符串中有多个子串与模式匹配是时，定位到第2个与模式匹配的子串，返回该子串的起始索引
select '3' as num,regexp_instr('hello123world456','[0-9]+',1,2)as result
union all
select '4' as num,regexp_instr('hello123world456hello123','[0-9]+',1,3)as result

结果如下：

num	result
1	6
2	7
3	14
4	22

regexp_replace\substr()代码：

select '1' as num,regexp_replace('hello123world','[0-9]+',' ') as result
union ALL
select '2' as num,regexp_substr('hello123world','[0-9]+') as result

结果如下：