可移植的 Scheme 正则表达式库 pregexp.scm 文档翻译

最新推荐文章于 2021-01-21 08:20:24 发布

weixin_30784945

最新推荐文章于 2021-01-21 08:20:24 发布

阅读量262

点赞数

文章标签：人工智能

原文链接：http://www.cnblogs.com/zh-geek/p/6263026.html

版权

pregexp.scm 被很多 Scheme 实现作为内置的正则表达式引擎使用。比如 Racket 里使用的正则表达式引擎就是从它的基础上发展而来的。甚至连文档也大同小异。所以，本文的大部分内容对 Racket 也适用。难能可贵的是，pregexp 没有使用某个实现特有的语法或特性，所以它的可移植性很好，只需要少量的修改就能够在几乎所有实现上跑起来。当然，pregexp 的开发年代很早了，也许可能 Racket 里的实现会的一些性能改善或者 BUG 修复。

1. 简介

正则表达式是一个模式字符串，正则表达式匹配器会尝试与另一个字符串（的一部分）进行匹配，被匹配的字符串被视为原始文本，而不是一个模式。

正则表达式中的大多数字符会匹配原始文本中出现的自己。因此， "abc"会匹配包含a, b, c三个连续字符的字符串。

在正则表达式模式中，一些字符被视为“元字符”，一些字符序列被视为“元序列”，也就是说，它表示的并不是该字符本身。例如，在正则表达式 "a.c" 中，字符a和c表示的是字符 a和c本身，然而.可以匹配任意的字符（除了换行符）。所以， "a.c"可以匹配以a开头，以c结尾的任意三个字符，比如： "abc", "aac", "afc", "a*c"...

如果我们需要精确匹配.本身，就需要使用转义字符，就是在前面加上一个反斜杠 \，反斜杠也是一个元字符，但是它不匹配任何字符，而是将紧跟着它的元字符变成一个普通字符。比如: "a\\.c"可以匹配"a.c", 使用双斜杠的原因是，在 Scheme 的字符串中，反斜杠本身就是转义字符，要在Scheme字符串中包含一个反斜杠，就需要双反斜杠。就像在 C 中一样。另一个例子是 \t，它以一种可读的方式来表示 tab 字符。

我们将字符串表示的正则表达式称为 U-regexp ，U 可以被解释为 Unix-style 或者 universal 。因为这种正则表达式的表示法被普遍接受。我们的实现使用一种树形的中间表示法，称之为 S-regexp ，S 可以被理解为 Scheme, symbolic 或者 S-expression. S-regexp 更冗长，并且不易读，不易理解，但是便于 Scheme 的递归过程处理。

2. 正则表达式过程

pregexp.scm 提供了如下几个过程： pregexp , pregexp-match-positions , pregexp-match, pregexp-split, pregexp-replace, pregexp-replace*, pregexp-quote. 由 pregexp.scm 引入的所有过程都有 'pregexp' 前缀，所以它们不太可能和 Scheme 中的其他名称冲突，包括由实现本身提供的正则表达式过程的名称。

2.1 pregexp

pregexp 接受一个字符串表示的正则表达式模式(U-regexp), 返回一个 S-regexp 。

(pregexp "c.r")
=> (:sub (:or (:seq #\c :any #\r)))

2.2 pregexp-match-positions

pregexp-match-positions 过程接受一个正则表达式和一个原始文本字符串，如果匹配成功，返回一个 match，否则返回 #f。

正则表达式可以是 UNIX 风格的正则字符串，或者是树形的 S-regexp 。在内部， pregexp-match-positions 首先将字符串表示的正则表达式编译成 S-regexp ，然后再进行匹配。如果你发现一个正则表达式有可能会被多次用到，那么明智的做法是用 pregexp 过程将它显式地转换成 S-regexp ，并且保存在一个临时变量中，这样可以节省重新编译的时间。

pregexp-match-positions 返回 #f(如果匹配失败) 或者一个点对列表(如果匹配成功).

(pregexp-match-positions "brain" "bird")
=> #f

(pregexp-match-positions "needle" "hay needle stack")
=> ((4 . 10))

在第二个例子里，整数 4 和 10 标志着被匹配的子串，4 代表子串的索引开始，10 代表索引结束(10 索引处的字符并不包括在内，这与普遍意义上的字符串索引是一致的)。

(substring "hay needle stack" 4 10)
=> "needle"

这里， pregexp-match-positions 返回的列表仅包含一个索引对，该索引对表示匹配的子串在整个字符串中的位置。当我们稍后讨论子模式时，我们将看到单个匹配操作如何产生子匹配列表。

pegexp-match-positions 接受可选的第三和第四个参数，指定将要被匹配的子串。

(pregexp-match-positions "needle"
  "his hay needle stack -- my hay needle stack -- her hay needle stack"
  24 43)
=> ((31 . 37))

注意，返回的索引依然是相对于整个字符串来计算的。

2.3 pregexp-match

pregexp-match 的调用类似于 pregexp-match-positions ，但是它返回的是匹配的子串，而不是索引位置。

(pregexp-match "brain" "bird")
=> #f

(pregexp-match "needle" "hay needle stack")
=> ("needle")

pregexp-match 同样接受可选的第三和第四个参数。

2.4 pregexp-split

pregexp-split 过程接受两个参数，一个正则表达式以及一个文本字符串，返回文本字符串的子串构成的列表，由被匹配的子串充当分隔。

(pregexp-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
=> ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

(pregexp-split " " "pea soup")
=> ("pea" "soup")

如果第一个参数指定为空字符串，则返回由单个字符组成的列表：

(pregexp-split "" "smithereens")
=> ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")

要在分隔符中表示超过一个的空格，需要使用正则表达式 " +", 而不是 " *"

(pregexp-split " +" "split pea     soup")
=> ("split" "pea" "soup")

(pregexp-split " *" "split pea     soup")
=> ("s" "p" "l" "i" "t" "p" "e" "a" "s" "o" "u" "p")

2.5 pregexp-replace

regexp-replace 过程将匹配的子串替换为另一个字符串

(pregexp-replace "te" "liberte" "ty")
=> "liberty"

如果没有可匹配的子串，则原样返回文本字符串(eq? 意义上的相等，即同一个对象)。

2.6 pregexp-replace*

pregexp-replace* 替换所有被匹配的子串：

(pregexp-replace* "te" "liberte egalite fraternite" "ty")
=> "liberty egality fratyrnity"

和 pregexp-replace 一样，如果没有匹配，则原样返回原来的文本字符串

2.7 pregexp-quote

pregexp-quote 接受任意一个字符串，返回一个可以精确地表示它的 U-regexp （字符串）。特别是，在输入字符串中可以用作正则表达式元字符的特殊字符会被反斜杠转义，以便它们安全地只匹配自己。

(pregexp-quote "cons")
=> "cons"

(pregexp-quote "list?")
=> "list\\?"

当从一个混合了正则表达式字符串以及逐字的字符串构建复合的正则表达式时 pregexp-quote 相当有用。（为什么这么绕？）

3 正则表达式模式语言

这里完整地描述 pregexp 使用的正则表达式模式语言

3.1 基本的断言

^ 和 $ 分别表示字符串的开头和结尾。它们确保靠近它们的正则表达式匹配一个字符串的开头或结尾。例如:

(pregexp-match-positions "^contact" "first contact")
=> #f

匹配失败，因为 'contact' 并没有出现在文本字符串的开头。

(pregexp-match-positions "laugh$" "laugh laugh laugh laugh")
=> ((18 . 23))

该正则表达式匹配了最后一个 'laugh'。

元序列 \b 断言存在单词边界。

(pregexp-match-positions "yack\\b" "yackety yack")
=> ((8 . 12))

'yackety' 里的 'yack' 后边没有存在单词边界，所以它没有被匹配。第二个 'yack' 则匹配成功。

元序列 \B 的意思正好相反。它断言单词边界不存在。

(pregexp-match-positions "an\\B" "an analysis")
=> ((3 . 5))

多说一句，第一个出现的 'an'，后面是空格，所以没有被匹配；而 'analysis' 开头的 'an'，后面紧挨着的是'alysis'，没有间隔存在，所以被匹配。

3.2 字符和字符类

通常，正则表达式中的字符与文本字符串中相同的字符相匹配。有时，使用正则表达式来引用单个字符是必要的或者方便的。因此，元序列 \n, \r, \t 以及 \. 分别匹配 newline, return, tab 以及. 。

元字符 . 匹配除了 \n 之外的任意字符。

(pregexp-match "p.t" "pet")
=> ("pet")

它同样匹配 'pat', 'pit', 'pot', 'put', 以及 'p8t'，但是不能匹配 'pfffft'.

字符类匹配一组字符集合中的任意一个字符。典型的字符类是由方括号括起来的一组字符 [...], 它匹配方括号中包含的非空字符序列中的任意一个字符。因此，"p[aeiou]t" 可以匹配 'pat', 'pet', 'pit', 'pot', 'put' 等等。

在方括号中，两个字符之间的连号 - 指定 ASCII 码表里，两个字符之间的一个范围。例如，"ta[b-dgn-p]" 匹配 'tab', 'tac', 'tad', 'tag', 以及 'tan', 'tao', 'tap'。

左括号后面的符号 ^ 反转由剩下的内容指定的集合，即它指定除方括号中标识的字符之外的字符集合。例如，"do[^g]" 匹配由 'do' 开头的所有三个字符，除了 'dog'。

要注意，方括号里的 ^ 和它在方括号外的意思完全不一样。大多数其他元字符(. * + ?等)到了方括号中就不再是元字符了，虽然为了 peace of mind 仍然可以转义它们。- 只有在方括号内才是一个元字符，当然它不能是方括号里的第一个，也不能是最后一个字符。

方括号字符类不能包含其他带方括号的字符类（尽管它们能包含某些其他类型的字符类——下面将会看到）。因此，在一个带方括号的字符类中，单独的左括号不再是一个元字符，它可以代表它自己。例如："[a[b]" 匹配 'a', '[', 能及 'b'。

此外，由于方括号字符类不能为空，所以紧接在开头的左括号之后的右括号也不被视为元字符。例如："[]ab]" 匹配 ']', 'a' 和 'b'。

3.2.1 常用的字符类

一些标准字符类可以方便地表示为元序列，而不是显式的方括号表达式。\d 匹配一个数字[0-9]；\s 匹配一个空白字符；\w 匹配可能是“单词”的一部分的字符。（遵循正则表达式的惯例，我们认定“单词”字符是 [A-Za-z0-9_] , 也就是能用做 C 语言标识符的字母、数字和下划线）, 虽然这与一个 Scheme 程序员所认为的单词的定义相比可能太过严格（在 Lisp 和 Scheme 里，标识符所能使用的字符太自由了）。

这些元序列的大写版本表示相反的意思，\D 匹配非数字字符，\S 匹配非空白字符，\W 匹配非单词字符。

将这些元序列放置在 Scheme 字符串中时，请记住要写成双反斜械：

(pregexp-match "\\d\\d"
  "0 dear, 1 have 2 read catch 22 before 9")
=> ("22")

这些字符类可以使用在一个方括号表达式中，例如："[a-z\\d]"匹配一个小写字母或者一个数字。

3.2.2 POSIX 字符类

POSIX 字符类是一种格式为 [: ... :] 的特殊元序列，只能在方括号表达式中使用。支持的 POSIX 字符类包括：

[:alnum:]       ;; 字母和数字
[:alpha:]       ;; 字母
[:algor:]       ;; 字母 'c', 'h', 'a' 和 'd'
[:ascii:]       ;; 7位 ASCII 字符
[:blank:]       ;; 空白符，即 空格 和 制表符（不包括回车？）
[:cntrl:]       ;; 控制字符，即 ASCII 码表中小于 32 的那些
[:digit:]       ;; 数字，与 '\d' 相同
[:graph:]       ;; ???
[:lower:]       ;; 小写字母
[:print:]       ;; ???
[:space:]       ;; 空白符，与 '\s' 相同
[:upper:]       ;; 大写字母
[:word:]        ;; 字母，数字以及下划线，与 \w 相同
[:xdigit:]      ;; 十六进制数字

例如，正则表达式"[[:alpha:]_]" 匹配一个字母或下划线

(pregexp-match "[[:alpha:]_]" "--x--")
=> ("x")

(pregexp-match "[[:alpha:]_]" "--_--")
=> ("_")

(pregexp-match "[[:alpha:]_]" "--:--")
=> #f

POSIX 类只有在额外的方括号中才有效，当它不在方括号表达式中时，例如 "[:alpha:]"，不会被认为是字母类。按照以前的原则，它只能匹配 ':', 'a', 'l', 'p', 'h' 这几个字符。

(pregexp-match "[:alpha:]" "--a--")
=> ("a")

(pregexp-match "[:alpha:]" "--_--")
=> #f

通过在 [: 后面紧跟着插入一个 ^, 你得到 POSIX 字符类的反转。因此，[:^alpha:] 表示除了字母以外的所有字符。

3.3 量词

量词 *, + 以及 ? 分别匹配前面的子模式： 0或0个以上，1个或1个以上，0个或1个实例。

(pregexp-match-positions "c[ad]*r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]*r" "cr")
=> ((0 . 2))

(pregexp-match-positions "c[ad]+r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]+r" "cr")
=> #f

(pregexp-match-positions "c[ad]?r" "cadaddadddr")
=> #f
(pregexp-match-positions "c[ad]?r" "cr")
=> ((0 . 2))
(pregexp-match-positions "c[ad]?r" "car")
=> ((0 . 3))

3.3.1 数字量词

你可以使用大括号来指定比使用 * + ? 更精细的数量。

量词 {m} 精确过匹配前面的子模式 m 个实例， m 必须是非负的整数。

量词 {m,n}; 匹配最少 m 个，最多 n 个实例。m 和 n 必须是非负的整数，并且 m <= n。两者都可以省略，在这种情况下，m 默认为 0, 而 n 表示无限大。

很明显，+ 和 ? 分别是 {1,} 和 {0,1} 的缩写，* 是 {,} 的缩写，并且与 {0,} 等价。

(pregexp-match "[aeiou]{3}" "vacuous")
=> ("uou")

(pregexp-match "[aeiou]{3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "zeugma")
=> ("eu")

3.3.2 非贪心量词

上面所描述的量词都是贪心的，即，它们匹配所能匹配的最大数量的实例。

(pregexp-match "<.*>" "<tag1> <tag2> <tag3>")
=> ("<tag1> <tag2> <tag3>")

要将这些量词变成 非贪心 的，在后面附加一个问号 ? 即可。非贪心量词只匹配最小数量的实例。

(pregexp-match "<.*?>" "<tag1> <tag2> <tag3>")
=> ("<tag1>")

非贪心量词分别是：*?, +?, ??, {m}?, {m,n}?。要注意元字符 ? 的两种不同的用法。

3.4 集群

集群，就是用圆括号包围起来的表达式(...), 将圆括号中的子模式识别为一个单独的正则表达式实体。它使得匹配器捕获子模式，并且将文本字符串中匹配子模式的部分附加到整体匹配当中。所谓整体匹配，就是假装所有的圆括号都不存在（在子模式后面有量词的情况下，这种表述不正确），进行匹配。整体匹配后再将每一对圆括号都视为一个单独的正则表达式，分别进行匹配，最后匹配的结果会附加到整体匹配的结果里面去。

(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1" "1970")

集群还导致接下来的量词将整个封闭起来的子模式视为一个独立的实体。

(pregexp-match "(poo )*" "poo poo platter")
=> ("poo poo " "poo ")

子匹配所返回的数量总是等于正则表达式中指定的子模式的数量。哪怕一个子模式匹配多个子串，或者是一个也不匹配。

(pregexp-match "([a-z ]+;)*" "lather; rinse; repeat;")
=> ("lather; rinse; repeat;" " repeat;")

在这里，被量词修饰的子模式匹配了三次，但是最后它只返回了一次。

被量词修饰的子模式也有可能不匹配，即便总体是是匹配成功的。在这种情况下，失败的子匹配用 #f 表示。

(define date-re
  ;match `month year' or `month day, year'.
  ;subpattern matches day, if present
  (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))

(pregexp-match date-re "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1," "1970")

(pregexp-match date-re "jan 1970")
=> ("jan 1970" "jan" #f "1970")

3.4.1 反向引用

子匹配可以用于插入字符串参数的过程 pregexp-replace 和 pregexp-replace* . 插入字符串可以使用\n作为反向引用返回第 n 个子匹配。即匹配第 n 个子模式的子串。\0引用整个匹配，它也可以指定为\&。

(pregexp-replace "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the _pinta_, and the _santa maria_"

(pregexp-replace* "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the *pinta*, and the *santa maria*"

;recall: \S stands for non-whitespace character

(pregexp-replace "(\\S+) (\\S+) (\\S+)"
  "eat to live"
  "\\3 \\2 \\1")
=> "live to eat"

在插入字符串中使用 \\ 指定一个字面的反斜杠。另外，\$ 代表空字符串，可以用于将反引用 \n 与紧领的数字分隔开。

也可以在正则表达式械中使用反向引用来引用回到模式中已经匹配的子模式。\n 代表第 n 个子匹配的精确重复。

(pregexp-match "([a-z]+) and \\1"
  "billions and billions")
=> ("billions and billions" "billions")

注意，反向引用不仅仅是前面的子模式的重复。相反，它是已经由子模式匹配的特定子串的重复。

在上面的例子中，反向引用只能匹配 'billions', 它不能匹配 'millions'，就算是子模式回到 ([a-z]+) —— 本来就没有这样做的必要。

(pregexp-match "([a-z]+) and \\1"
  "billions and millions")
=> #f

The following corrects doubled words:

(pregexp-replace* "(\\S+) \\1"
  "now is the the time for all good men to to come to the aid of of the party"
  "\\1")
=> "now is the time for all good men to come to the aid of the party"

下面的例子标记了在数字字符串中所有立即重复的模式：

(pregexp-replace* "(\\d+)\\1"
  "123340983242432420980980234"
  "{\\1,\\1}")
=> "12{3,3}40983{24,24}3242{098,098}0234"

3.4.2 非捕获集群

有时会需要指定一个集群（通常用于量化），但不能触发子匹配信息的捕获。这样的集群称为非捕获集群。在这种情况下，使用 (?: 而不是 ( 作为集群的开始。在下面的例子中，非捕获集群消除了给定路径名的“目录”部分，而捕获集群标识了文件名。

(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
  "/usr/local/bin/mzscheme")
=> ("/usr/local/bin/mzscheme" "mzscheme")

3.4.3 Cloisters

在一个非捕获集群的 ? 和 : 之间的位置称为 cloister . 你可以在那里添加修饰符，这将产生一个被特殊处理的子模式。修饰符 i 使子模式匹配大小写不敏感：

(pregexp-match "(?i:hearth)" "HeartH")
=> ("HeartH")

修饰符 x 使子模式匹配对空白符不敏感，即，子模式中的空格和注释将被忽略。注释通常以分号开头，一直延续到行末。如果你需要在对空白不敏感的子模式中包含一个字面意义上的空格或者分号，可以用反斜杠来转义它们。

(pregexp-match "(?x: a   lot)" "alot")
=> ("alot")

(pregexp-match "(?x: a  \\  lot)" "a lot")
=> ("a lot")

(pregexp-match "(?x:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "a man; a plan; a canal")
=> ("a man; a plan; a canal")

全局变量 *pregexp-comment-char* 包含了注释字符 (#\;) ，要使用 Perl 风格的注释符，可以：

(set! *pregexp-comment-char* #\#)

你可以在 cloister 里添加更多的修饰符

(pregexp-match "(?ix:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "A Man; a Plan; a Canal")
=> ("A Man; a Plan; a Canal")

在一个修饰符前添加减号- 会反转其含义。因此，你可以使用 -i 以及 -x 来推翻由封闭集群引起的不敏感性。

(pregexp-match "(?i:the (?-i:TeX)book)"
  "The TeXbook")
=> ("The TeXbook")

This regexp will allow any casing for the and book but insists that TeX not be differently cased.

3.5 Alternation

You can specify a list of alternate subpatterns by separating them by |. The | separates subpatterns in the nearest enclosing cluster (or in the entire pattern string if there are no enclosing parens).

(pregexp-match "f(ee|i|o|um)" "a small, final fee")
=> ("fi" "i")

(pregexp-replace* "([yi])s(e[sdr]?|ing|ation)"
   "it is energising to analyse an organisation
   pulsing with noisy organisms"
   "\\1z\\2")
=> "it is energizing to analyze an organization
   pulsing with noisy organisms"

再次提醒，如果你希望仅使用 clustering merely to specify a list of alternate subpatterns ，但是不希望子匹配，请使用(?: 而不是 (

(pregexp-match "f(?:ee|i|o|um)" "fun for all")
=> ("fo")

关于 alternation 一个重要的事情是，最左边的 alternate 总是被最先挑选，而不管它的长度。因此，如果一个 alternate 是之后 alternate 的前缀，则后者可能没有机会被匹配。

(pregexp-match "call|call-with-current-continuation"
  "call-with-current-continuation")
=> ("call")

所以，为了让较长的 alternate 有被匹配的机会，请将较长的 alternate 放在较短的 alternate 前面。

(pregexp-match "call-with-current-continuation|call"
  "call-with-current-continuation")
=> ("call-with-current-continuation")

In any case, an overall match for the entire regexp is always preferred to an overall nonmatch. In the following, the longer alternate still wins, because its preferred shorter prefix fails to yield an overall match.

(pregexp-match "(?:call|call-with-current-continuation) constrained"
  "call-with-current-continuation constrained")
=> ("call-with-current-continuation constrained")

3.6 回溯

我们已经看到，贪心量词总是匹配最大次数，但是最重要的优先级是整个匹配成功。考虑

(pregexp-match "a*a" "aaaa")

该正则表达式由两个子正则表达式组成，a 后面跟着 *a 。就算 * 是一个贪心量词，
*a 也不被允许匹配 "aaaa" 中所有的 4 个 a , 它只能匹配最开始的 3 个 a，留下最后一个 a 用于第二个子正则表达式。这样将确保整个正则表达式匹配成功。

正则表达式匹配器通过一个称为回溯的过程来做到这一点。匹配器暂时允许贪心量词匹配所有的 4 个 a ，但是当它意识到这样会导致整体匹配失败时，它会回溯到更少的贪心匹配 3 个 a，甚至如果这样还会失败，比如下面的调用：

(pregexp-match "a*aa" "aaaa")

匹配器还会进一步回溯，只有当所有可能的回溯都尝试过才会发生整体匹配失败。

回溯并不限于贪心量词，非贪心量词匹配尽可能少的实例，并逐渐回溯到越来越多的实例，以实现整体匹配成功。在 alternation 的匹配中也会进行回溯，当左边的 alternation 会导致整体匹配失败时，会尝试右边的 alternation 。

3.6.1 禁止回溯

有时禁止回溯会更有效。例如，我们可能希望做出选择，或者我们知道尝试 alternatives 是徒劳的。非回溯式正则表达式包含在 (?>...). 之间

(pregexp-match "(?>a+)." "aaaa")
=> #f

在这个调用里，子表达式 ?>a+ 贪婪地匹配所有 4 个 a，并且拒绝回溯的机会。所以整体匹配失败。因此这个正则表达式的效果是匹配一个或多个 a，后面跟一个肯定不是 a 的东西。

3.7 展望未来

You can have assertions in your pattern that look ahead or behind to ensure that a subpattern does or does not occur. These “look around” assertions are specified by putting the subpattern checked for in a cluster whose leading characters are: ?= (for positive lookahead), ?! (negative lookahead), ?<= (positive lookbehind), ?<! (negative lookbehind). Note that the subpattern in the assertion does not generate a match in the final result. It merely allows or disallows the rest of the match.

3.7.1 Lookahead

Positive lookahead (?=) peeks ahead to ensure that its subpattern could match.

(pregexp-match-positions "grey(?=hound)"
  "i left my grey socks at the greyhound")
=> ((28 . 32))

The regexp "grey(?=hound)" matches grey, but only if it is followed by hound. Thus, the first grey in the text string is not matched.

Negative lookahead (?!) peeks ahead to ensure that its subpattern could not possibly match.

(pregexp-match-positions "grey(?!hound)"
  "the gray greyhound ate the grey socks")
=> ((27 . 31))

The regexp "grey(?!hound)" matches grey, but only if it is not followed by hound. Thus the grey just before socks is matched.

3.7.2 Lookbehind

Positive lookbehind (?<=) checks that its subpattern could match immediately to the left of the current position in the text string.

(pregexp-match-positions "(?<=grey)hound"
  "the hound in the picture is not a greyhound")
=> ((38 . 43))

The regexp (?<=grey)hound matches hound, but only if it is preceded by grey.

Negative lookbehind (?<!) checks that its subpattern could not possibly match immediately to the left.

(pregexp-match-positions "(?<!grey)hound"
  "the greyhound in the picture is not a hound")
=> ((38 . 43))

The regexp (?<!grey)hound matches hound, but only if it is not preceded by grey.

Lookaheads and lookbehinds can be convenient when they are not confusing.

转载于:https://www.cnblogs.com/zh-geek/p/6263026.html

weixin_30784945

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
可移植的 Scheme 正则表达式库 pregexp.scm 文档翻译

pregexp.scm 被很多 Scheme 实现作为内置的正则表达式引擎使用。比如 Racket 里使用的正则表达式引擎就是从它的基础上发展而来的。甚至连文档也大同小异。所以，本文的大部分内容对 Racket 也适用。难能可贵的是，pregexp 没有使用某个实现特有的语法或特性，所以它的可移植性很好，只需要少量的修改就能够在几乎所有实现上跑起来。当然，pregexp 的开发年代很早了，也许可能...
复制链接

扫一扫