正则表达式环顾神秘

最新推荐文章于 2023-11-24 08:00:00 发布

cunchi8090

最新推荐文章于 2023-11-24 08:00:00 发布

阅读量571

点赞数

文章标签：字符串正则表达式 java python 编程语言

原文链接：https://www.experts-exchange.com/articles/4318/Regular-Expression-Lookaround-Demystified.html

版权

As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power, at least not to the average programmer, but it is the kind of power that can be used to save numerous lines of code. One of more complicated regex tools I'd like to describe to you is that of

正如大多数使用或接触过它们的人都可以证明的那样，正则表达式（regex）有点复杂。在其隐含的语法中如此简洁地包含了巨大的力量。至少对于普通程序员而言，这不是“接管世界”的功能，而是可以用来节省大量代码的那种功能。我想向您介绍的更复杂的正则表达式工具之一是

lookaround. When executed properly, lookaround can supercharge your patterns to provide you pattern-matching capabilities otherwise achieved through numerous procedures and even more numerous lines of code.

Regular expression lookaround is not a glaringly simple concept when you first see it. For this reason, readers of this article should at least be familiar with regular expressions in general. EE contributor BatuhanCetin has written a nice introduction to regular expressions here: Regular Expressions Starter Guide.

当您第一次看到正则表达式时，它并不是一个简单明了的概念。因此，本文的读者至少应大致上熟悉正则表达式。 EE贡献者BatuhanCetin在这里对正则表达式进行了很好的介绍：正则表达式入门指南。

Outside of its complexity, another thing to be mindful of is that not every regex engine supports lookaround. If you plan on experimenting with any of the patterns demonstrated in this article, you should confirm that your editor or language supports lookaround. As described in the section Types of Lookaround, the two directions of lookaround are lookahead and lookbehind. Regex engines can implement none, one, or both directions. Be sure you are using a utility which supports the type of lookaround you are testing.

除了其复杂性之外，还要注意的另一件事是，并非每个正则表达式引擎都支持环视。如果您打算尝试本文中演示的任何模式，则应确认您的编辑器或语言支持环视。如

Let me first start with a clarification. There is a theoretical concept of regular expression and a practical concept. Of course, the practical is based on the theoretical. The difference is that we don't have a concept of lookaround in theoretical regular expressions--at least not in the sense that we use them in the practical case. This article deals with the practical case, obviously!

首先让我澄清一下。有正则表达式的理论概念和实践概念。当然，实践是基于理论的。区别在于我们在理论正则表达式中没有环顾概念-至少在实际情况下没有使用环顾概念。显然，本文涉及实际案例！

什么是(What Is Lookaround?)

环顾四周的整体概念很简单-在匹配过程中我目前的位置上，向前看（或在后面看，取决于），并在继续之前查看某些模式是否匹配（或不匹配，取决于）。 “大不了！这就是正则表达式模式本身所做的。它通过检查每个字符来匹配文本，”您说吗？

Well, the first thing to be aware of when working with lookaround is that it is a non-consuming match. A non-consuming match is a match that is evaluated to see if can succeed, but it is not actually consumed by the regex engine. What I mean by not being consumed is that when your regex engine evaluates a character and determines that it is still in line with the pattern, it "forgets" about this character and evaluates the next character. During the course of this article you will see that this is not entirely true; for the moment, accept that it is.

好吧，使用环视时要注意的第一件事是这是一项

One way of thinking about this non-consumption idea is to think of it like going to the deli and taking a number. Let's say you pull number 5 and then you leave. At the time of your departure, you know you have 5, so you can safely assume that 6 is the next ticket (because you know the tickets are sequential, in this case). After ten minutes pass, you return to the deli. You look at the number dispenser and ask yourself, "What is the next number to be dispensed?" You are not going to actually take the number, you just want to look and see what it is. Why? Who knows. Perhaps you just like knowing that you got in-and-out before the next deli-lover arrived.

考虑这种非消费观念的一种方法是将其视为去熟食店并取一个数字。假设您拉5号，然后离开。出发时，您知道有5张票，因此可以放心地假设下一张票为6张（在这种情况下，因为您知道票是连续的）。经过十分钟后，您将返回熟食店。您看着号码分配器问自己：“下一个要分配的号码是什么？” 您实际上并不会拿这个数字，而只是想看看它是什么。为什么？谁知道。也许您只是想知道自己在下一个熟人到来之前进进出出。

环顾类型 (Types of Lookaround)

What happened in the deli example could be considered a positive lookahead. In many (but not all) regex engines, we have two directions of lookaround: lookahead and lookbehind. Both of these directions are as they sound: lookahead peeks forward of the current position and lookbehind peeks backward. I previously said the regex engine forgets about a character once it has been determined to satisfy the pattern. Here's the contradiction: when you use a lookbehind, you can actually peek at characters the engine has already evaluated.

在熟食店的例子中发生的事情可以被认为是积极的前瞻 。在许多（但不是全部）正则表达式引擎中，我们有两个环顾方向： lookahead和lookbehind 。这两个方向听起来都一样：向前看向当前位置向前看，向后看向后方看。我之前说过，一旦确定要满足模式，则正则表达式引擎会忽略该字符。这是矛盾之处：使用后向搜索时，实际上可以窥视引擎已评估的字符。

In addition to the directions, we also have two concepts of matching: positive (matching) and negative (not matching). When you use a positive lookaround, you are informing the regex engine that you would like to verify some pattern can be matched. With a negative lookaround, you want some pattern to not match. The thing to be mindful of in using a negative lookaround is that failing to match a pattern is actually a success. As with direction, not all regex engines implement both concepts of matching.

除了指示之外，我们还有两个匹配的概念：正（匹配）和负（不匹配）。当使用肯定的环顾时，您是在通知正则表达式引擎您希望验证某些模式可以匹配。环顾四周时，您希望某些模式

Here's a summary of the four primitive possibilities you can have with lookaround:

以下是环顾四周的四种原始可能性的摘要：

positive lookahead: ahead of current position, see if pattern matches 正向前瞻 ：在当前位置之前，查看模式是否匹配

positive lookbehind: prior to current position, see if pattern matches 正向后看 ：在当前位置之前，查看模式是否匹配

negative lookahead: ahead of current position, see if pattern 负前瞻 ：当前位置之前，查看模式 does not match

negative lookbehind: prior to current position, see if pattern 负向后看 ：在当前位置之前，查看模式 does not match

示例环顾 (Lookaround by Example)

当您盯着模式中的构造时，环顾四周可能会有些令人费解。如果您将具有环视的模式视为具有两个指针，则可能会更容易-一个指针用于模式本身（使用部分），而一个用于环视（非使用部分）。这是两个示范。

Lookahead

展望

Let's say you are interested in checking a password field, which can accept alpha-numeric characters, for the existence of at least one digit. There are a couple of ways you can approach this. You could write your pattern as:

假设您有兴趣检查密码字段（该密码字段可以接受字母数字字符）是否存在至少一位数字。有两种方法可以解决此问题。您可以将模式写为：

^[a-zA-Z0-9]*[0-9][a-zA-Z0-9]*$

^ [a-zA-Z0-9] * [0-9] [a-zA-Z0 -9] * $

^(?=.*?[0-9])[a-zA-Z0-9]+$ ^（？=。*？[0-9]）[a-zA-Z0-9] + $

Notice there is a new construct in the pattern: (?= ... ). This denotes a lookahead, and it is postive ( = ). A negative lookahead would exchange the equals for an exclamation point ( ! ). This syntax is typical of most regex engines.

注意，模式中有一个新构造： （？= ...） 。这表示先行，并且是肯定的（=）。负前瞻会将等号交换为感叹号（！）。此语法是大多数正则表达式引擎的典型语法。

Now in this trivial example, the benefits aren't that bountiful. But for now, I'm going to stick with i for the subsequent illustrations. To see a more real-world-applicable example, see the "Real-world Examples" section of the article.

现在在这个琐碎的示例中，好处并不那么丰富。但就目前而言，我将继续使用i进行后续插图。要查看更实际的示例，请参见本文的“实际示例”部分。

Let's initialize our engine with the password ab1c:

让我们使用密码

before

the "a". This is because you can match positions as opposed to characters with regex. If you have ever used ^ or $ to match the beginning or end of a string, respectively, then you have matched positions. In fact, ^ at the beginning of our pattern above matches the location of the red arrow in the figure.

“ a”。这是因为您可以匹配位置，而不是使用正则表达式来匹配字符。如果您曾经使用^或$分别匹配字符串的开头或结尾，那么您已经匹配了位置。实际上，上方图案开头的^与图中红色箭头的位置匹配。

without forgetting our current position.

Now we evaluate the lookahead.

现在，我们评估前瞻性。

The first part of the lookahead specifies the non-greedy dot-star notation, which means it will match any character, zero-or-more times. The match will be minimal, so the first successful match will indicate success. In short, this part of the pattern will match the first two letters in the pattern and advance our lookahead pointer to the only digit in the target string. The .* [0-9] put the engine in this state:

前瞻的第一部分指定非贪心的点星符号，这意味着它将匹配零个或更多次的任何字符。匹配将是最小的，因此第一个成功的匹配将指示成功。简而言之，模式的这一部分将匹配模式中的前两个字母，并将超前指针前进到目标字符串中的唯一数字。 。* [0-9]使引擎处于以下状态：

Notice that our main pointer has not moved at all. Again, this is because our lookahead is non-consuming. Because the lookahead succeeded, we can continue processing the remainder of the pattern. Here is the state of the engine after the success of the lookahead:

请注意，我们的主指针根本没有移动。同样，这是因为我们的前瞻性是非消耗性的。由于先行成功，因此我们可以继续处理模式的其余部分。超前成功后，这是引擎的状态：

Yes, it's the same as our initialized engine. In the interest of space, I will not show the progress of our main pointer--just realize that at this point, since our lookahead succeeded, the remainder of the processing of our pattern will occur as we expect, checking each character one-by-one until the end of string is reached. Because then non-lookahead portion of our pattern is [a-zA-Z0-9]+ and our string consists of only letters and a single digit, the pattern as a whole will match.

是的，它与我们初始化的引擎相同。为了节省空间，我不会显示主指针的进度，只是要意识到，在这一点上，由于成功完成了先行工作，剩下的处理模式将按我们的预期进行，逐个检查每个字符-一个，直到到达字符串的末尾。因为我们的模式的非超前部分是[a-zA-Z0-9] +，并且我们的字符串仅包含字母和一位数字，所以整个模式将匹配。

Had we made our lookahead negative instead of positive, the existence of the digit within our password would have caused the match to fail. Of course, it's a bit contradictory to specify that your character class be comprised of alphanumeric characters, and then have a lookahead that says to not find any digits. The point to bear in mind, is that the effect of changing from positive to negative would cause failure in this example.

如果我们将前瞻性设置为负数而不是正数，那么密码中数字的存在将导致匹配失败。当然，指定您的字符类由字母数字字符组成，然后先行说没有找到任何数字，这有点矛盾。要记住的一点是，在此示例中，从正变为负的结果将导致失败。

Lookbehind

向后看

Let's modify the previous password requirement set forth by our original pattern. We now want a pattern which will match passwords containing at least one digit and that digit must not occur at the end of the string. Continuing with the same target password string, we could again accomplish this via a well-structured pattern:

让我们修改原始模式提出的先前的密码要求。现在，我们需要一种模式，该模式将匹配包含至少一位数字的密码，并且该数字不得出现在字符串的末尾。继续使用相同的目标密码字符串，我们可以再次通过结构良好的模式来完成此操作：

^[a-zA-Z0-9]+[a-zA-Z]$ ^ [a-zA-Z0-9] + [a-zA-Z] $

not be found at some point in the target string. If we modify the pattern to use a negative lookbehind, we could end up with:

^[a-zA-Z0-9]+(?<![0-9])$ ^ [a-zA-Z0-9] +（？<！[0-9]）$

Admittedly, there are not many keystroke savings in this pattern, but it will demonstrate how negative lookbehind works. As with the lookahead example, you will notice another novel construct in this pattern: (?<! ... ). This indicates the lookaround to be a lookbehind. This time, the matching concept is negative; to convert to positive, you would exchange the exclamation point with an equals sign.

诚然，在这种模式下节省的击键次数并不多，但是它将证明负向后看是如何工作的。与前瞻示例一样，您会注意到这种模式中的另一个新颖构造： （？<！...） 。这表明环视是

Similarly to the lookahead example, our engine will have the same initialized state. The engine will process each character within our target string, consuming each character up to the end of the string. But, now we have a lookbehind to process. Here's what we see for the initialized lookbehind:

与前瞻示例相似，我们的引擎将具有相同的初始化状态。引擎将处理目标字符串中的每个字符，直到字符串末尾都消耗掉每个字符。但是，现在我们有一个需要处理的问题。这是初始化后的样子：

but since the lookbehind is negative, not finding a digit is a success. Since "c" is not a digit, our negative lookbehind succeeds, and subsequently our entire pattern succeeds. Had we used a positive lookbehind, our pattern would have failed since we wanted to find a digit, but instead found an alpha character.

Again, this is a very trivial example. Please have a look at the "Real-world Examples" section of the article for a more realistic use of this feature.

同样，这是一个非常简单的例子。请查看本文的“实际示例”部分，以更实际地使用此功能。

环视的局限性
(Limitations of Lookaround
)

到目前为止，环视功能是对正则表达式匹配的强大扩展。不幸的事实是，并非每个引擎都支持环视。许多引擎都有前瞻性实现，但是有一些引擎不支持后瞻性。 IDE最常缺少的环视引擎是那些IDE附带的引擎（特别是查找/替换对话框）。查看您的语言（或IDE的）文档以获取环视支持。

Another restriction of most regex engines is that lookbehind (and possibly some lookaround) cannot have unbounded patterns within them. An unbounded pattern would be one that can have an unlimited number of repetitions. Using star and plus quantifiers would be one example. The only language(s) I have personally encountered which do support unbounded quantifiers within lookaround are the .NET languages. One way to overcome this limitation would be to give some upper-bounded quantifier (we're talking curly braces here) that has a very large number. It's not very extensible, but it could get you by.

大多数正则表达式引擎的另一个限制是，后视（可能还有一些环视）中不能包含无界模式。一个无界模式将是无限次重复的模式。使用星号和加号的量词就是一个例子。唯一的语言（S）我亲身遇到过这

Addendum: Having just participated in a question dealing with it, I have come to find out that the Boost C++ libraries support unbounded lookaround--at least v1.40 does. 附录：刚刚参加了一个有关它的问题，我发现Boost C ++库支持无限的环顾四周-至少v1.40可以。

Many of the languages which support lookaround also support capture groups within lookaround. The caveat with this feature is that some languages only preserve the capture within the lookaround itself; others allow the captured value to be backreferenced outside of the lookaround. Refer to your language's documentation to confirm the scope of capture groups.

许多支持环视的语言也支持环视中的捕获组。需要注意的是，某些语言仅将捕获内容保留在环视本身内。其他允许捕获的值在环视之外被反向引用。请参阅您语言的文档以确认捕获组的范围。

实际例子
(Real-world Examples
)

在完成所有无聊的工作之后，我确定您准备好以有效的方式了解如何实现环顾四周。这里列出了一些环视的实际应用程序，并解释了每种模式为何起作用。

Passwords Containing Special Characters and of a Specific Length

包含特殊字符和特定长度的密码

Scenario

情境

You want to ensure that a password meets a set of criteria. The password should be between 8 and 15 characters, contain at least one upper-case alpha character, contain at least one lower-case alpha character, contain at least one digit, and contain at least one of the following: $, %, #, @, &.

您要确保密码符合一组条件。密码应介于8到15个字符之间，至少包含一个大写字母字符，至少一个小写字母字符，至少一位数字，以及至少以下之一：$，％，＃，@，＆。

Pattern

模式

^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[$%#@&]).{8,15}$

^（？=。*？[AZ]）（？=。*？[az]）（？=。*？[0-9] ）（？=。*？[$％＃@＆]）。{8,1 5} $

Why This Works

为什么这样有效

We have four lookaheads: one for each of the above conditions specifying a type of character to be included in the password. After matching the beginning of the string, we evaluate each lookahead. Because lookaround is non-consuming, we never leave the void before the first character of the string upon completion of each lookahead. Each lookahead checks for the existence of one of the character restrictions specified, using dot-star to skip over any unimportant characters. By the time we have evaluated the last lookahead, all that is left is to evaluate the bounded dot of the pattern. Since dot matches any character, we bound the dot to restrict the length of our string (^ and $ are required to make the bound effective).

我们有四个前瞻：对于上述每个条件，一个都指定要包含在密码中的字符类型。匹配字符串的开头后，我们评估每个前瞻。由于环视是非消耗性的，因此在每次超前完成后，我们绝不会在字符串的第一个字符之前留下空白。每个前瞻检查是否存在指定的字符限制之一，使用点星号跳过所有不重要的字符。在我们评估最后一次前瞻时，剩下的就是评估图案的有界点。由于点匹配任何字符，因此我们将点绑定以限制字符串的长度（需要使用^和$来使绑定生效）。

Caveats

注意事项

One's first instinct might be to combine the lookaheads into one to save keystrokes. The reason not to do this is that if your requirement is that the characters can be at any position, in any order within the target string, then the four separate lookaheads are needed. If instead your requirement is that they occur at any position, but in a specific order, then you could combine the four into one lookaround, concatenating each required character condition with a dot-star to ignore unimportant characters.

一个人的本能可能是将前瞻组合成一个，以节省击键次数。不这样做的原因是，如果您的要求是字符可以在目标字符串中的

Use of the bounded dot at the end could be a security concern for you. I used it here for simplicity, but you would really want to provide further restrictions on what type of characters your password can consist of. No reason to accept null characters as valid input unless you really allow passwords to have them!

最后，使用有界点可能是您的安全隐患。为了简单起见，我在这里使用了它，但是您真的想对密码可以包含的字符类型提供进一步的限制。除非您真正允许密码包含空字符，否则没有理由接受空字符作为有效输入！

The ^ and $ would be required for this particular application. If you did not include them, then the bounded dot at the end of the pattern would be pointless.

对于此特定应用程序，将

Extract the Integer Portion of a Decimal Number, If and Only If it Has a Fractional Part

如果且仅当具有小数部分时，提取小数的整数部分

Scenario

情境

You want to find the integer portion of a decimal number within text. You have a peculiar requirement that you only want values if they have a decimal part. Why? Who knows. No one ever said the business side had any sense :)

您想在文本中找到小数的整数部分。您有一个特殊的要求，即只有小数部分才需要值。为什么？谁知道。没有人说过业务方面有任何意义:)

Pattern

模式

\d+(?=\.\d+)

\ d +（？= \。\ d +）

Why This Works

为什么这样有效

The engine finds one or more digits, then checks for the existence of a decimal point and one-or-more digits. Since the lookahead here is positive, the match will only succeed if the engine finds a decimal part. Since lookahead is non-consuming, the engine has only matched, overall, the integer value of the double number.

引擎找到一个或多个数字，然后检查是否存在小数点和一个或多个数字。由于此处的前瞻为正，因此只有在引擎找到小数部分时，匹配才会成功。由于提前查询是非消耗性的，因此引擎总体上仅匹配双精度整数值。

Caveats

注意事项

None.

没有。

Find a Word That Is NOT Preceded by Some Word or Phrase

查找某个单词或词组之前没有的单词

Scenario

情境

You are looking for a particular word. The condition for finding this word, though, is that it not be preceded by some other particular word or phrase.

您正在寻找一个特定的单词。但是，找到该单词的条件是它之前

Pattern

模式

(?<!Hello )World

（？<！Hello）世界

Why This Works

为什么这样有效

The engine searches for the word "World". Once it finds it, it begins evaluating the lookbehind. If it finds the string "Hello" and a trailing space, then the lookbehind fails, since it is a negative lookbehind.

引擎搜索单词“世界”。一旦找到它，它便开始评估后面的外观。如果找到字符串“ Hello”和尾随空格，则后向失败，因为它是负向后向。

Caveats

注意事项

The patterns inside the lookbehind function the same as patterns outside the lookbehind. As such, just being inside the lookbehind doesn't implicitly make the search for "Hello" case-insensitive. Having a target string of "hello World" would cause the lookbehind to succeed. One interesting feature of some regex engines is that you can turn on case-insensitivity within certain scopes. Changing the pattern to:

后置内部的模式与后置外部的模式相同。因此，仅位于后方并不意味着搜索“ Hello”时不区分大小写。具有“ hello World”的目标字符串将使后面的查找成功。

(?<!(?i)Hello )World

（？<！（？i）Hello）世界

would turn on case-insensitivity just for the lookbehind. Even in engines which support lookaround, this feature is not always available.

会为后面的情况开启不区分大小写的功能。即使在支持环视的引擎中，此功能也不总是可用。

Split Pascal-cased Identifiers Into Component Words

将Pascal大小写的标识符拆分为组成词

Scenario

情境

You follow good variable-naming conventions and your convention in use is Pascal casing (sometimes called CamelCase). You want to split a variable name into its component parts.

您遵循良好的变量命名约定，并且使用的约定是Pascal大小写（有时称为

Pattern

模式

(?<=[a-z])(?=[A-Z])

（？<= [az]）（？= [AZ]）

Why This Works

为什么这样有效

This one is a bit tricky to explain, but I'll do my best.

这个解释起来有些棘手，但我会尽力而为。

You could think of this, in a way, as looping through the voids between characters. While in each of these voids, we look backward to find a lower-case alpha character. If that succeeds, we look forward to find an upper-case alpha character. If both succeed, we do a replace substituting in a space. You can think of this void as being turned into a space.

在某种程度上，您可以认为这是在字符之间的空隙中循环。在每个这些空格中，我们向后查找一个小写字母字符。如果成功的话，我们期待找到一个大写的字母字符。如果两者都成功，我们将替换一个空格。您可以将这个空白视为一个空间。

You could accomplish the same thing by doing a search for a lower-case alpha adjoined on the right with an upper-case alpha, capture each character in its own capture group, and then enter backreferences, each separated by a space into the replacement.

您可以通过搜索右侧与大写字母相邻的小写字母，捕获其捕获组中的每个字符，然后输入反向引用（每个引用之间用空格分隔）来完成相同的操作。

Caveats

注意事项

None.

没有。

Note: 注意：

The next two examples are slightly specialized and I developed them for a question I answered here on EE (see the complete question and answer here).

接下来的两个示例是专门的，我针对在EE上回答的问题开发了它们（请参阅此处的完整问题和解答）。

Tokenizer

分词器

Scenario

情境

The scenario here is to split a string into tokens. For the question's purposes, tokens were considered anything separated by a spaces, punctuation, or other non-word characters. Let me stave off the faint-of-heart by saying that if you have not gotten comfortable with lookaround prior to this point, then I would suggest you avoid looking at this next pattern. Believe me, it was difficult enough to write!

这里的场景是将字符串拆分为令牌。出于问题的目的，令牌被视为由空格，标点符号或其他非单词字符分隔的任何内容。让我避免说这句话，如果您在此之前对环视尚不满意，那么我建议您避免考虑下一种模式。相信我，这很难写！

Pattern

模式

|(?<=\w)(?=\W)|(?<=\W)(?=\W)|(?<=\W)(?=\w)

|（？<= \ w）（？= \ W）|（？<= \ W）（？= \ W）|（？<= \ W）（？= \ w）

Note: The first character in the above pattern is a space.

注意：以上样式中的第一个字符是空格。

Why This Works

为什么这样有效

What we have is a series of OR conditions. The first condition checks for a space; if we find a space then the split is trivial. The second condition, similar to the Split Pascal-cased Identifiers Into Component Words example above, effectively loops through the voids between characters. To the left of the void, we check for a non-word character; to the right, a word character. If both conditions are met, a split occurs on the void and both "words" are preserved. The remaining two conditions work the same way. The third condition checks for two non-word characters and the fourth condition checks for a non-word character on the left and a word character on the right.

我们所拥有的是一系列OR条件。第一个条件检查空间；如果我们找到一个空间，则拆分是微不足道的。第二个条件类似于上面的将

Caveats

注意事项

The downside of this approach is that because of the inner working of OR in regex, the splits produced by the lookahead parts of the pattern will end up producing null (empty string) entries in the output array. If you were to use this example, then be aware that you would need to check for null values in some of your array slots.

这种方法的缺点是，由于正则表达式中OR的内部工作，模式的超前部分产生的拆分最终将在输出数组中产生null（空字符串）条目。如果要使用此示例，请注意，您需要检查某些阵列插槽中的空值。

Note: 注意：

It's highly unlikely you'd want to do something like the following, but I'm including it to demonstrate how you can nest lookaround expressions.

您极不可能希望执行以下操作，但是我将其包含在内以演示如何

Tokenizer on Steroids

类固醇上的分词器

Scenario

情境

The requirement for the above tokenizer changed during the course of the question. The new requirement was for dates to be treated as single tokens rather than being split at the separators. Let me stave off the faint-of-heart by saying that if you have not gotten comfortable with lookaround prior to this point, then I would suggest you avoid looking at this next pattern. Believe me, it was difficult enough to write!

在问题过程中，上述标记器的要求已更改。新的要求是将日期视为单个令牌，而不是在分隔符处进行拆分。让我避免说这句话，如果您在此之前对环视尚不满意，那么我建议您避免考虑下一种模式。相信我，这很难写！

Pattern

模式

\s+|(?<=\w)(?=\W)(?!(?<=\d)(?=([-/])\d\d?\1(?:\d\d){1,2}))(?!(?<=\d([-/])\d\d?)(?=\2(?:\d\d){1,2}))|(?<=\W)(?=\W)|

\ s + |（？<= \ w）（？= \ W）（？！（？<= \ d ）（？=（[-/]） \ d \ d？\ 1（?: \ d \ d）{1,2} ））（？！（？<= \ d（[-/]）\ d \ d？）（？= \ 2（？：\ d \ d）{1,2 }））|（？<= \ W ）（？= \ W）|

(?<=\W)(?=\w)(?!(?<=\d([-/]))(?=\d\d?\3(?:\d\d){1,2}))(?!(?<=\d\4\d\d?([-/]))(?=(?:\d\d){1,2}))

（？<= \ W）（？= \ w）（？！（？<= \ d（[-/ ]））（？= \ d \ d ？\ 3（？：\ d \ d ）{1,2}））（？！（？<= \ d \ 4 \ d \ d？（[-/]））（？=（？：\ d \ d）{1,2}））

Note: I have split the pattern into two lines to prevent line breaks in awkward places when you are viewing this page. Take note that this is one pattern and should be treated as one string if you experiment with it.

注意：我将模式分为两行，以防止在查看此页面时在尴尬的地方出现换行。请注意，这是一种模式，如果您尝试使用它，应将其视为一个字符串。

Why This Works

为什么这样有效

The basic parts from Tokenizer are still employed here, but I added a few lookarounds to check whether, during the course of matching, the engine was currently looking at part of a date. The new lookarounds all function pretty much the same way: a negative lookahead is used to not match a condition, and within the negative lookahead, I use a combination of positive lookbehinds and positive lookaheads to see what is before and after the current position within the engine--if as a whole the engine finds what comprises a valid date structure, then the negative lookahead fails, and a split does not occur. If instead the engine does not find a valid date structure, the negative lookahead succeeds and I split at the current position.

此处仍然使用

Caveats

注意事项

The same caveat in Tokenizer still applies here. As you can see, this one is parentheses-laden. It's easy to leave a parentheses out when constructing patterns such as this. I advise having a text editor which provides bracket matching so you don't lose track of your parentheses!

Recall from the Lookaround by Example section that I said you can think of lookaround like being an extra, temporary pointer that moves around in your target string independent of, but relatively to the match pointer. For each level of nesting you embed in your lookarounds, add an additional pointer, where subsequent levels are relative to their parent lookaround's current position pointer.

回想一下“

摘要 (Summary)

如您现在所见，环顾四周扩展了正则表达式的基本功能。尽管并非所有引擎都提供环视的实现，但确实允许您在紧凑的单元中执行一些有趣的匹配，替换和拆分功能。

Be sure to confirm which types of lookaround your language and its engine support, including whether or not positive and negative lookarounds. Just as with base regular expressions, you should always know what your inputs will be and test your pattern with a variety of inputs. Lookarounds can accept the same regular expressions you would normally use, so the same rules apply inside the lookaround which exist outside of the lookaround. Also, be sure to be attentive in writing your patterns--it is very easy to get lost in a sea of parentheses!

确保确认您的语言及其引擎支持的环视类型，包括正向和负向环视。与基本正则表达式一样，您应该始终知道输入将是什么，并使用各种输入来测试模式。环顾四周可以接受您通常使用的相同正则表达式，因此，环顾四周内部存在的规则适用于环顾四周。另外，请务必专心编写样式-容易在括号中迷路了！

You've made it this far. Congratulations! I hope I haven't embedded Matrix-esque visions of regular expression symbols into your subconscious. For most everyday matching needs, you will find satisfaction with the base functionality provided by regular expressions. With the examples you have seen above, you should be able to get a sense of when using lookaround might be beneficial--either out of necessity or out of a preference to save keystrokes. It was my intent to make you more comfortable with regular expression lookaround. If I failed, don't stress; just stay positive and take a lookaround the 'net.

到目前为止，您已经做到了。恭喜你！我希望我没有将正则表达式符号的矩阵式视觉嵌入到您的潜意识中。对于大多数日常匹配需求，您会发现对正则表达式提供的基本功能感到满意。通过上面看到的示例，您应该能够了解使用环视何时可能是有益的-出于必要或出于节省击键的偏好。我的目的是使您对正则表达式环视更加满意。如果我失败了，别紧张。保持