学习正则表达式 - 边界

wzy0623

已于 2023-05-19 11:27:50 修改

阅读量1.6k

点赞数 1

分类专栏：用MySQL学正则文章标签：正则表达式 mysql

于 2023-04-27 11:31:59 首次发布

本文链接：https://blog.csdn.net/wzy0623/article/details/130402746

版权

用MySQL学正则专栏收录该内容

26 篇文章 0 订阅

订阅专栏

一、零宽断言

断言（assertions）从字面上理解就是判定是还是否。在正则表达式的系统里，也就是匹配或者不匹配。随便写一个正则表达式，都能产生匹配或者不匹配的结果，所以可以这样说，所有的正则表达式都可以叫断言。

我们也经常会看到零宽断言（zero-width assertions）这个概念。普通的断言，比如 \d+ （匹配一个或者多个数字），它所匹配的内容有长度的；而有些断言比如 ^ 和 $ （分别匹配行开头和结尾）不匹配字符，而是匹配字符串中的位置，这样可以理解为它所匹配的内容长度为0，所以称这类断言为零宽断言(zero-width assertions)。

实际中很多时候提到断言，通常是指零宽断言。零宽断言又可分为锚位符（anchor）和环视（Lookarounds）两类。

锚位符会根据字符串中的当前位置导致匹配成功或失败，但它们不会导致引擎在字符串中前进或消耗字符。下表中列出的元字符是锚位符。

断言	描述	模式	匹配
^	匹配字符串或行的开头。	^\d{3}	901 in 901-333-
$	匹配字符串或行的末尾。	-\d{3}$	-333 in -901-333
\A	匹配字符串的开头。	\A\d{3}	901 in 901-333-
\Z	匹配字符串的末尾，或字符串末尾的 \n 之前。	-\d{3}\Z	-333 in -901-333
\z	匹配字符串的末尾。	-\d{3}\z	-333 in -901-333
\G	匹配必须出现在上一个匹配结束的位置。	\G$\d$	(1)、(3)、(5) in (1)(3)(5)[7](9)
\b	匹配 \w 和 \W字符之间的边界上。	\b\w+\s\w+\b	them theme、them them in them theme them them
\B	匹配非边界。	\Bend\w*\b	ends、ender in end sends endure lender

环视就是要求匹配部分的前面或后面要满足（或不满足）某种规则，如下表所示。

断言	名称	含义	示例
(?<=Y)	肯定逆序环视positive-lookahead	左边是Y	(?<=\d)th 左边是数字的th，能匹配 9th
(?<!Y)	否定逆序环视negative-lookahead	左边不是Y	(?<!\d)th 左边不是数字的th，能匹配 health
(?=Y)	肯定顺序环视positive-lookbehind	右边是Y	six(?=\d)右边是数字的six，能匹配six6
(?!Y)	否定顺序环视negative-lookbehind	右边不是Y	hi(?!\d)右边不是数字的hi，能匹配high

二、行的开始和结束

1. ^ 与 $

就像之前看到的那样，要匹配行或字符串的起始要使用脱字符（U+005E）^。根据上下文，^ 会匹配行或者字符串的起始位置，有时还会匹配整个文档的起始位置。而上下文则依赖于应用程序和在应用程序中所使用的选项。

若要匹配行或字符串的结尾位置要使用美元符 $。正如前一篇中介绍的单行模式与多行模式所述，如果不选择多行模式，整个目标文本被视做一个字符串。

对于上一篇生成的测试数据来说，使用多行模式时，^How.*Country$ 匹配两行，否则只匹配一行。

mysql> select * from t_regexp where regexp_like(a,'^How.*Country$','m')\G
*************************** 1. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
*************************** 2. row ***************************
a: How a Ship having passed the Line was driven by Storms to the cold Country
2 rows in set (0.00 sec)

2. dotall 模式

正则表达式中的 dotall 模式，表示 . 匹配行结束符，而缺省 . 遇到行结束符时会终止匹配。在MySQL的正则表达式函数中，使用 match_type 的 n 值表示使用 dotall 模式。看如下正则表达式：

^THE.*\?$

我们想匹配以THE开头，以 ? 结束的字符串，如果不指定 dotall 模式，不会返回任何记录。指定 dotall 模式后，可以看到它匹配了整个文本。注意MySQL正则中的转义要写两个 \ 。

mysql> select * from t_regexp where regexp_like(a,'^THE.*\\?$')\G
Empty set (0.00 sec)

mysql> select * from t_regexp where regexp_like(a,'^THE.*\\?$','n')\G
*************************** 1. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
1 row in set (0.00 sec)

dotall 选项表示点号除了匹配其他字符之外，还会匹配换行符。取消 dotall 选项，表达式 ^THE.* 则匹配第一行；使用 dotall 选项后，全部文本都会被匹配，不需要使用 \?$ 来匹配文本的结尾。

三、单词边界和非单词边界

\b 匹配单词边界，如 \bTHE\b 匹配单词 THE。就像 ^ 和 $ 一样，\b 是个零宽度断言，表面上它会匹配空格或者是行起始，而实际上它匹配的是个零宽度的不存在的东西。这个理解起来不是很容易，但可以通过观察它匹配和不匹配的内容来理解。

mysql> select regexp_like('the m','\\bTHE\\b');
+----------------------------------+
| regexp_like('the m','\\bTHE\\b') |
+----------------------------------+
|                                1 |
+----------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_like('them','\\bTHE\\b');
+---------------------------------+
| regexp_like('them','\\bTHE\\b') |
+---------------------------------+
|                               0 |
+---------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_like('the m','\\bTHE\\b','c');
+--------------------------------------+
| regexp_like('the m','\\bTHE\\b','c') |
+--------------------------------------+
|                                    0 |
+--------------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_like('the m','\\bTHE\\b','i');
+--------------------------------------+
| regexp_like('the m','\\bTHE\\b','i') |
+--------------------------------------+
|                                    1 |
+--------------------------------------+
1 row in set (0.00 sec)

还可以匹配非单词边界。非单词边界匹配除单词边界之外的位置，比如单词或者字符串中的字母或数字。例如 \Be\B 匹配字母e，而匹配的字母 e 的两边都是其他字母或者是非单词字符。零宽度断言不会匹配两边的字符，但它会识别文字 e 的两边是否是非单词边界。

下面看两个具体的应用。

1. 统计某个单词出现的次数

假设要统计 THE 出现的次数，不区分大小写，实现如下，结果为9。

select (length(a) - length(convert(regexp_replace(a,'\\bthe\\b','') using utf8mb4)))/3 a 
  from t_regexp 
 where regexp_instr(a,'\\n');

说明：

regexp_instr(a,'\\n') 条件只返回带有换行符的多行字符串。
regexp_replace(a,'\\bthe\\b','') 将原字符串中的 the 单词替换掉，用 \b 确定单词边界。regexp_replace函数缺省不区分大小写。
用原字符串长度减去替换掉 the 后的字符串长度，再除以 the 这个单词的长度，结果即为 the 出现的次数。
convert 转换字符集的原因是，MySQL 8.0.17之前有bug，结果返回的是UTF-16字符集，而不是原字符串的字符集，导致 length 函数的返回值会翻倍。（Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)）
MySQL没有提供类似于Oracle的regexp_count()函数，因此只能用替换掉需统计字符串再取长度差的通用方法。

2. 统计单词个数

统计多行字符串中的单词个数（即著名的Wordcount），实现如下，结果为95。

select a, length(convert(regexp_replace(regexp_replace(regexp_replace(regexp_replace(a,'\\s+$','',1,0,'m'),'\\.|,|\\?|"|:|;',' '),'\\s{2,}',' '),'\\w','') using utf8mb4)) c
  from t_regexp 
 where regexp_instr(a,'\\n')\G

说明：

regexp_instr(a,'\\n') 条件只返回带有换行符的多行字符串。
regexp_replace(a,'\\s+$','',1,0,'m') 使用多行模式替换掉所有空行。\s 匹配一个空白字符，包括空格、制表符、换页符和换行符；+ 匹配前面一个字符重复一次或更多次；$ 匹配字符串的结束。多行空行即为以空格开头开头，中间重复多个空格或换行符，再加此字符串结束的一串字符。
regexp_replace(..., '\\.|,|\\?|"|:|;',' ') 将所有相关标点符号替换成空格，用于外层的 \w+ 匹配。
regexp_replace(..., '\\s{2,}',' ') 将多个空格压缩为一个，避免统计多次。
length(convert(regexp_replace(..., '\\w','') using utf8mb4)) 将所有匹配单词替换掉后，剩下的空格个数即为单词数量。\w+ 匹配的结果如下图所示。

四、主题词的起始与结束位置

与锚位符 ^ 相似，简写式 \A 匹配主题词的起始。要匹配主题词的结尾，可以使用 \Z 或 \z。 \Z 和 \z 之间的不同在于当遇到换行符时 \Z 会将其看做字符串结尾匹配，而 \z 只匹配字符串结尾。所谓主题词，简单但不严谨的理解就是将被测试字符串看成一个单一字符串，其首尾的单词。\A \Z \z 不受回车、换行、空行的影响，因此与匹配模式无关。从下面的例子可以看到，即使使用多行模式，\A 也不会匹配除首行外目标字符串。

mysql> select regexp_like(
    -> 'aaa
    '> the bbb','\\A\\s*(THE|The|the)') m;
+---+
| m |
+---+
| 0 |
+---+
1 row in set (0.00 sec)

mysql> select regexp_like(
    -> 'aaa
    '> the bbb','\\A\\s*(THE|The|the)','m') m;
+---+
| m |
+---+
| 0 |
+---+
1 row in set (0.00 sec)

\A\s*(THE|The|the) 匹配单词the出现在行首位置且之前有零个或多个空格。同样是这个正则表达式，测试表数据中可以匹配两行。

mysql> select * from t_regexp where regexp_like(a,'\\A\\s*(THE|The|the)')\G
*************************** 1. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
*************************** 2. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
2 rows in set (0.00 sec)

(MARINERE|Marinere)(.)?\Z 会匹配一行（主题词）尾部的MARINERE或Marinere，之后是任何可选字符。在本例中可选字符就是标点符号或者字母S。点号两边的括号不是必需的。

mysql> select * from t_regexp where regexp_like(a,'(MARINERE|Marinere)(.)?\\Z');
+------------------------------------+
| a                                  |
+------------------------------------+
| 1       It is an ancyent Marinere, |
+------------------------------------+
1 row in set (0.00 sec)

如果将 MARINERE|Marinere 换成 me，则可以匹配多行字符串的结尾。

mysql> select * from t_regexp where regexp_like(a,'(me)(.)?\\Z')\G
*************************** 1. row ***************************
a: THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
*************************** 2. row ***************************
a: 4          "Now wherefore stoppest me?
2 rows in set (0.00 sec)

五、使用元字符的字面值

可以用 \Q 和 \E 之间的字符集匹配字符串字面值。为了展示这一点，考虑如下字符串：

.^$*+?|(){}[]\-

这15个元字符在正则表达式中有特殊含义，用来编写匹配模式。连字符在字符组的方括号中用来表示范围，但在其他情况下无特殊含义。

在尝试匹配这些字符时，可能被MySQL判定为非法的正则表达式，但如果放在 \Q 和 \E 之间则会匹配字面值，因为\Q和\E之间的任意字符都会被解释为普通字符。当然也可以只用转义匹配字面值。

mysql> select regexp_like('.^$*+?|(){}[]\-','+');
ERROR 3688 (HY000): Syntax error in regular expression on line 1, character 1.
mysql> select regexp_like('.^$*+?|(){}[]\-','\\Q+\\E');
+------------------------------------------+
| regexp_like('.^$*+?|(){}[]\-','\\Q+\\E') |
+------------------------------------------+
|                                        1 |
+------------------------------------------+
1 row in set (0.00 sec)

mysql> select regexp_like('.^$*+?|(){}[]\-','\\+');
+--------------------------------------+
| regexp_like('.^$*+?|(){}[]\-','\\+') |
+--------------------------------------+
|                                    1 |
+--------------------------------------+
1 row in set (0.00 sec)

六、在段首加标签

上篇中我们为每行文本加了标签，现在要在段首添加HTML标签。可以利用 \A 的特性轻松实现。如前所述，无论是否使用多行模式，替换结果都一样。(\\A.*) 捕获分组获取第一行，<h1>$1<\h1> 在第一行首尾加标签。

mysql> select regexp_replace(a,'(\\A.*)','<! DOCTYPE html>\n<html lang="en">\n<head><title>Rime</title></head>\n<body>\n<h1>$1<\h1>') a
    ->   from t_regexp
    ->  where regexp_instr(a,'\\n')\G
*************************** 1. row ***************************
a: <! DOCTYPE html>
<html lang="en">
<head><title>Rime</title></head>
<body>
<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1>
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
1 row in set (0.00 sec)

也可以将正则表达式改为 ^(.*)$ 匹配整行，然后只替换多行模式的第一行，能达到相同的效果。

mysql> select regexp_replace(a,'^(.*)$','<! DOCTYPE html>\n<html lang="en">\n<head><title>Rime</title></head>\n<body>\n<h1>$1<\h1>',1,1,'m') a
    ->   from t_regexp
    ->  where regexp_instr(a,'\\n')\G
*************************** 1. row ***************************
a: <! DOCTYPE html>
<html lang="en">
<head><title>Rime</title></head>
<body>
<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.<h1>
ARGUMENT.
How a Ship having passed the Line was driven by Storms to the cold Country
towards the South Pole; and how from thence she made her course to the tropical
Latitude of the Great Pacific Ocean; and of the strange things that befell;
and in what manner the Ancyent Marinere came back to his own Country.
I.
1       It is an ancyent Marinere,
2          And he stoppeth one of three:
3       "By thy long grey beard and thy glittering eye
4          "Now wherefore stoppest me?
1 row in set (0.00 sec)

用 sed 命令实现（rime.txt文件中为原文本内容）：

sed '1!b; i\
<! DOCTYPE html>\
<html lang=\"en\">\
<head>\
<title>Rime</title>\
</head>\
<body>
s/^/<h1>/
s/$/<\/h1>/' rime.txt

sed中的插入命令 i 允许在文件或字符串中的某个位置之前插入文本，1!b; 只替换第一行。

wzy0623

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
学习正则表达式 - 边界

零宽断言、行的开始和结束、dotall 模式、单词边界和非单词边界、主题词的起始与结束位置、使用元字符的字面值、在段首加标签等
复制链接

扫一扫

专栏目录

学习正则表达式 - 边界

一、零宽断言

二、行的开始和结束

1. ^ 与 $

2. dotall 模式

三、单词边界和非单词边界

1. 统计某个单词出现的次数

2. 统计单词个数

四、主题词的起始与结束位置

五、使用元字符的字面值

六、在段首加标签

“相关推荐”对你有帮助么？